Avg. Total Time
78.82s
Avg. TTFT
12.28s
Avg. Prefill TPS
3335.57
Avg. Gen TPS
22.64
Context Size
262144
Quantization
r64
Engine
vllm
Creation Method
LoRA
Model Type
Qwen35
Chat Template
Qwen3.5
Reasoning
Yes
Vision
Yes
Parameters
27B
Added At
4/27/2026
base_model: ToastyPigeon/Qwen3.5-27B-Marvin-V2 library_name: transformers pipeline_tag: text-generation tags:
A Qwen3.5-27B model fine-tuned for high-quality creative writing and roleplay, with DPO applied to reduce repetition, suppress AI-isms, and improve writing style.
Qwen/Qwen3.5-27B
→ ArliAI/Qwen3.5-27B-Derestricted (safety filter removal)
→ SFT: 5,974 samples (4,478 Marvin literary + 1,497 Seed RP)
= ToastyPigeon/Qwen3.5-27B-Marvin-V2
→ DPO: 402 combined preference pairs
= This model (Marvin-DPO-V2)
Combined DPO with three objectives trained simultaneously in a single run:
| Subset | Pairs | Purpose |
|---|---|---|
| Anti-repetition (rewritten) | 102 | Suppress sentence/paragraph-level repetition. Thinking traces rewritten from verbose (avg 704w) to concise (avg 91w). |
| Anti-repetition (RP context) | 100 | Anti-repetition in roleplay scenarios. 20 unique character/setting combinations. |
| Style cleanup | 200 | Improve prose quality. Chosen: Marvin literary corpus excerpts. Rejected: model-generated versions of the same scenes. 50% book-style / 30% asterisk-action / 20% mixed format. |
Think masking: DPO loss is computed only on the response content after </think>, not on the thinking traces themselves. This prevents the DPO signal from accidentally training away the model's ability to think.
Tested across 5 scenarios (temp=0.8, top_p=0.9):
| Test | -ing patterns | Slop phrases | Notes |
|---|---|---|---|
| RP coffeeshop scene | 0 | 0 | Natural dialogue, good pacing |
| Hemingway style transfer | 1 | 0 | Short declarative sentences, understated |
| Chandler noir style | 4 | 0 | Vivid metaphors, atmospheric |
| Emotional scene (slop trap) | 2 | 0 | Grounded, no AI-isms |
| Instruction following | 1 | 0 | Doesn't write for user |
| Model | Repeated 4/5-grams (3+) | Total -ing |
|---|---|---|
| V2 Base (no DPO) | 6× "corner of my", 4× "tugging at the", 3× "loose strand" | 4 |
| DPO-V2 (this model) | 1× "corner of her" at 3× | 9 |
Zero sentence-level repetition across 8 turns of conversation, compared to significant repetition in the base model by turn 6.
"Quotation marks" for speech, plain text for narration, *italics* for inner thoughts<think>\n\n</think>\n\n prefill for non-thinking mode.# Combined DPO V2: antirep + style + thinking — on Marvin V2 base
# 402 pairs, 1 epoch, beta=0.1, LR=5e-6
# V2: think masking enabled, 63% of pairs have think blocks
model_name_or_path: ToastyPigeon/Qwen3.5-27B-Marvin-V2
output_dir: runs/qwen35-27b-combined-dpo-v2
attn_implementation: flash_attention_2
bf16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
model_parallel: true
max_memory:
0: "18GiB"
1: "18GiB"
chunked_mlp: true
chunked_mlp_chunks: 8
max_length: 2048
max_prompt_length: 512
max_completion_length: 1536
use_chunked_dpo: true
chunked_dpo_size: 4096
precompute_ref_log_probs: true
mask_thinking: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
use_peft: true
load_in_4bit: true
bnb_4bit_quant_type: nf4
lora_r: 32
lora_alpha: 16
lora_dropout: 0.0
use_rslora: true
lora_target_modules:
- in_proj_qkv
- in_proj_z
- in_proj_a
- in_proj_b
- out_proj
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
beta: 0.1
loss_type: sigmoid
learning_rate: 5.0e-6
lr_scheduler_type: cosine
warmup_ratio: 0.1
weight_decay: 0.0
max_grad_norm: 1.0
optim: paged_adamw_8bit
num_train_epochs: 1
logging_steps: 1
save_strategy: epoch
save_total_limit: 1
report_to: none
Training code: strangedove/loft (transformers-5x branch)
Q4_K_M quantization available at ToastyPigeon/Qwen3.5-Test-GGUFs as Qwen3.5-27B-Marvin-DPO-V2-Q4_K_M.gguf.