Qwen3.5-27B-Marvin-DPO-V2-Derestricted

Creative model

View on Hugging FaceBack to Models

Hourly Usage

Performance Metrics

Avg. Total Time

78.82s

Avg. TTFT

12.28s

Avg. Prefill TPS

3335.57

Avg. Gen TPS

22.64

Model Information

Context Size

262144

Quantization

r64

Engine

vllm

Creation Method

LoRA

Model Type

Qwen35

Chat Template

Qwen3.5

Reasoning

Yes

Vision

Yes

Parameters

27B

Added At

4/27/2026


base_model: ToastyPigeon/Qwen3.5-27B-Marvin-V2 library_name: transformers pipeline_tag: text-generation tags:

  • dpo
  • creative-writing
  • roleplay
  • qwen3.5
  • 27b license: other language:
  • en

Qwen3.5-27B-Marvin-DPO-V2

A Qwen3.5-27B model fine-tuned for high-quality creative writing and roleplay, with DPO applied to reduce repetition, suppress AI-isms, and improve writing style.

Model Stack

Qwen/Qwen3.5-27B
  → ArliAI/Qwen3.5-27B-Derestricted (safety filter removal)
    → SFT: 5,974 samples (4,478 Marvin literary + 1,497 Seed RP)
      = ToastyPigeon/Qwen3.5-27B-Marvin-V2
        → DPO: 402 combined preference pairs
          = This model (Marvin-DPO-V2)

DPO Training Details

Combined DPO with three objectives trained simultaneously in a single run:

SubsetPairsPurpose
Anti-repetition (rewritten)102Suppress sentence/paragraph-level repetition. Thinking traces rewritten from verbose (avg 704w) to concise (avg 91w).
Anti-repetition (RP context)100Anti-repetition in roleplay scenarios. 20 unique character/setting combinations.
Style cleanup200Improve prose quality. Chosen: Marvin literary corpus excerpts. Rejected: model-generated versions of the same scenes. 50% book-style / 30% asterisk-action / 20% mixed format.

Think masking: DPO loss is computed only on the response content after </think>, not on the thinking traces themselves. This prevents the DPO signal from accidentally training away the model's ability to think.

Hyperparameters

  • DPO beta: 0.1
  • Loss type: Sigmoid
  • Learning rate: 5e-6 (cosine schedule, 10% warmup)
  • LoRA: r=32, alpha=16, RSLoRA, no dropout
  • Quantization: QLoRA (NF4)
  • Precision: bf16
  • Batch size: 1 × 4 grad accumulation = effective 4
  • Epochs: 1
  • Training time: ~68 minutes on 2× RTX 3090

Training Metrics

  • Train loss: 0.117 avg
  • Reward accuracy: 100%
  • Reward margins: 6-8 (strong chosen/rejected separation)

Evaluation

Tested across 5 scenarios (temp=0.8, top_p=0.9):

Test-ing patternsSlop phrasesNotes
RP coffeeshop scene00Natural dialogue, good pacing
Hemingway style transfer10Short declarative sentences, understated
Chandler noir style40Vivid metaphors, atmospheric
Emotional scene (slop trap)20Grounded, no AI-isms
Instruction following10Doesn't write for user

Anti-Repetition (8-turn multi-turn test)

ModelRepeated 4/5-grams (3+)Total -ing
V2 Base (no DPO)6× "corner of my", 4× "tugging at the", 3× "loose strand"4
DPO-V2 (this model)1× "corner of her" at 3×9

Zero sentence-level repetition across 8 turns of conversation, compared to significant repetition in the base model by turn 6.

Recommended Settings

  • Temperature: 0.8
  • Top-p: 0.9
  • Format: "Quotation marks" for speech, plain text for narration, *italics* for inner thoughts

Limitations

  • Ethiopian Yirgacheffe appears disproportionately when the model discusses coffee (baked into base model training data)
  • Thinking mode is suppressed — the model produces empty think blocks. Use <think>\n\n</think>\n\n prefill for non-thinking mode.
  • Participial phrase patterns (-ing) are reduced but not eliminated

Training Config

train-v2.yaml (click to expand)
# Combined DPO V2: antirep + style + thinking — on Marvin V2 base
# 402 pairs, 1 epoch, beta=0.1, LR=5e-6
# V2: think masking enabled, 63% of pairs have think blocks

model_name_or_path: ToastyPigeon/Qwen3.5-27B-Marvin-V2
output_dir: runs/qwen35-27b-combined-dpo-v2

attn_implementation: flash_attention_2
bf16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

model_parallel: true
max_memory:
  0: "18GiB"
  1: "18GiB"

chunked_mlp: true
chunked_mlp_chunks: 8

max_length: 2048
max_prompt_length: 512
max_completion_length: 1536

use_chunked_dpo: true
chunked_dpo_size: 4096
precompute_ref_log_probs: true
mask_thinking: true

per_device_train_batch_size: 1
gradient_accumulation_steps: 4

use_peft: true
load_in_4bit: true
bnb_4bit_quant_type: nf4
lora_r: 32
lora_alpha: 16
lora_dropout: 0.0
use_rslora: true
lora_target_modules:
  - in_proj_qkv
  - in_proj_z
  - in_proj_a
  - in_proj_b
  - out_proj
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

beta: 0.1
loss_type: sigmoid

learning_rate: 5.0e-6
lr_scheduler_type: cosine
warmup_ratio: 0.1
weight_decay: 0.0
max_grad_norm: 1.0
optim: paged_adamw_8bit
num_train_epochs: 1

logging_steps: 1
save_strategy: epoch
save_total_limit: 1
report_to: none

Training code: strangedove/loft (transformers-5x branch)

Hardware

  • Training: 2× NVIDIA RTX 3090 (48GB total VRAM)
  • Inference: Fits in ~16GB VRAM at Q4_K_M quantization

GGUF

Q4_K_M quantization available at ToastyPigeon/Qwen3.5-Test-GGUFs as Qwen3.5-27B-Marvin-DPO-V2-Q4_K_M.gguf.