Qwen3.5-27B-Marvin-DPO-V2-Derestricted-Lite

Creative model

View on Hugging FaceBack to Models

Hourly Usage

Performance Metrics

Avg. Total Time

N/A

Avg. TTFT

N/A

Avg. Prefill TPS

N/A

Avg. Gen TPS

N/A

Model Information

Context Size

262144

Quantization

r64

Engine

vllm

Creation Method

LoRA

Model Type

Qwen35

Chat Template

Qwen3.5

Reasoning

Yes

Vision

Yes

Parameters

27B

Added At

4/27/2026


base_model: ToastyPigeon/Qwen3.5-27B-Marvin-V2 library_name: transformers pipeline_tag: text-generation tags:

  • dpo
  • creative-writing
  • roleplay
  • qwen3.5
  • 27b license: other language:
  • en

Qwen3.5-27B-Marvin-DPO-V2

A Qwen3.5-27B model fine-tuned for high-quality creative writing and roleplay, with DPO applied to reduce repetition, suppress AI-isms, and improve writing style.

Model Stack

Qwen/Qwen3.5-27B
  → ArliAI/Qwen3.5-27B-Derestricted (safety filter removal)
    → SFT: 5,974 samples (4,478 Marvin literary + 1,497 Seed RP)
      = ToastyPigeon/Qwen3.5-27B-Marvin-V2
        → DPO: 402 combined preference pairs
          = This model (Marvin-DPO-V2)

DPO Training Details

Combined DPO with three objectives trained simultaneously in a single run:

SubsetPairsPurpose
Anti-repetition (rewritten)102Suppress sentence/paragraph-level repetition. Thinking traces rewritten from verbose (avg 704w) to concise (avg 91w).
Anti-repetition (RP context)100Anti-repetition in roleplay scenarios. 20 unique character/setting combinations.
Style cleanup200Improve prose quality. Chosen: Marvin literary corpus excerpts. Rejected: model-generated versions of the same scenes. 50% book-style / 30% asterisk-action / 20% mixed format.

Think masking: DPO loss is computed only on the response content after </think>, not on the thinking traces themselves. This prevents the DPO signal from accidentally training away the model's ability to think.

Hyperparameters

  • DPO beta: 0.1
  • Loss type: Sigmoid
  • Learning rate: 5e-6 (cosine schedule, 10% warmup)
  • LoRA: r=32, alpha=16, RSLoRA, no dropout
  • Quantization: QLoRA (NF4)
  • Precision: bf16
  • Batch size: 1 × 4 grad accumulation = effective 4
  • Epochs: 1
  • Training time: ~68 minutes on 2× RTX 3090

Training Metrics

  • Train loss: 0.117 avg
  • Reward accuracy: 100%
  • Reward margins: 6-8 (strong chosen/rejected separation)

Evaluation

Tested across 5 scenarios (temp=0.8, top_p=0.9):

Test-ing patternsSlop phrasesNotes
RP coffeeshop scene00Natural dialogue, good pacing
Hemingway style transfer10Short declarative sentences, understated
Chandler noir style40Vivid metaphors, atmospheric
Emotional scene (slop trap)20Grounded, no AI-isms
Instruction following10Doesn't write for user

Anti-Repetition (8-turn multi-turn test)

ModelRepeated 4/5-grams (3+)Total -ing
V2 Base (no DPO)6× "corner of my", 4× "tugging at the", 3× "loose strand"4
DPO-V2 (this model)1× "corner of her" at 3×9

Zero sentence-level repetition across 8 turns of conversation, compared to significant repetition in the base model by turn 6.

Recommended Settings

  • Temperature: 0.8
  • Top-p: 0.9
  • Format: "Quotation marks" for speech, plain text for narration, *italics* for inner thoughts

Limitations

  • Ethiopian Yirgacheffe appears disproportionately when the model discusses coffee (baked into base model training data)
  • Thinking mode is suppressed — the model produces empty think blocks. Use <think>\n\n</think>\n\n prefill for non-thinking mode.
  • Participial phrase patterns (-ing) are reduced but not eliminated

Training Config

train-v2.yaml (click to expand)
# Combined DPO V2: antirep + style + thinking — on Marvin V2 base
# 402 pairs, 1 epoch, beta=0.1, LR=5e-6
# V2: think masking enabled, 63% of pairs have think blocks

model_name_or_path: ToastyPigeon/Qwen3.5-27B-Marvin-V2
output_dir: runs/qwen35-27b-combined-dpo-v2

attn_implementation: flash_attention_2
bf16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

model_parallel: true
max_memory:
  0: "18GiB"
  1: "18GiB"

chunked_mlp: true
chunked_mlp_chunks: 8

max_length: 2048
max_prompt_length: 512
max_completion_length: 1536

use_chunked_dpo: true
chunked_dpo_size: 4096
precompute_ref_log_probs: true
mask_thinking: true

per_device_train_batch_size: 1
gradient_accumulation_steps: 4

use_peft: true
load_in_4bit: true
bnb_4bit_quant_type: nf4
lora_r: 32
lora_alpha: 16
lora_dropout: 0.0
use_rslora: true
lora_target_modules:
  - in_proj_qkv
  - in_proj_z
  - in_proj_a
  - in_proj_b
  - out_proj
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

beta: 0.1
loss_type: sigmoid

learning_rate: 5.0e-6
lr_scheduler_type: cosine
warmup_ratio: 0.1
weight_decay: 0.0
max_grad_norm: 1.0
optim: paged_adamw_8bit
num_train_epochs: 1

logging_steps: 1
save_strategy: epoch
save_total_limit: 1
report_to: none

Training code: strangedove/loft (transformers-5x branch)

Hardware

  • Training: 2× NVIDIA RTX 3090 (48GB total VRAM)
  • Inference: Fits in ~16GB VRAM at Q4_K_M quantization

GGUF

Q4_K_M quantization available at ToastyPigeon/Qwen3.5-Test-GGUFs as Qwen3.5-27B-Marvin-DPO-V2-Q4_K_M.gguf.