Avg. Total Time
N/A
Avg. TTFT
N/A
Avg. Prefill TPS
N/A
Avg. Gen TPS
N/A
Context Size
262144
Quantization
r64
Engine
vllm
Creation Method
LoRA
Model Type
Qwen35
Chat Template
Qwen3.5
Reasoning
Yes
Vision
Yes
Parameters
27B
Added At
4/27/2026
license: mit datasets:
Designed for RP and writing tasks.
Dunno if it's better than v2 but I like it. Main difference is just the addition of some RP reasoning data from GLM5 & K2.5.
Non thinking and thinking are both supported. If you want to use thinking, it is required to prefill the <think>\n as that is how it was trained.
Creation Process: SFT
SFT on approx 56 million tokens.
Same as v2 for the most part with one big difference. Chub dataset was replaced with another version that has reasoning that was trained on the last turn only. This explodes the dataset out to 56 million tokens, but means the multi-turn reasoning gets trained correctly.
Also added a subset of 200 Gryphe RP samples that were shown as having a high lexical difference from my current dataset.
Trained using Axolotl.
base_model: Qwen/Qwen3.5-27B
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false
datasets:
- path: ./data/bluestar_v4_sft_2_masked_20260402_120553.jsonl
val_set_size: 0.03
output_dir: ./Qwen3.5-27B-v3-SFT-2
sequence_len: 10756
sample_packing: true
load_in_8bit: true
adapter: lora
lora_r: 128
lora_alpha: 128
peft_use_rslora: true
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- down_proj
- up_proj
# Uncomment below to also target the linear attention projections.
# These use separate in_proj_qkv / in_proj_z / out_proj (Qwen3.5-specific).
- linear_attn.in_proj_qkv
- linear_attn.in_proj_z
- linear_attn.out_proj
wandb_project: Qwen3.5-27B-SFT
wandb_name: Qwen3.5-27B-v3-SFT-2
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1.2e-5
weight_decay: 0.01
warmup_ratio: 0.05
bf16: auto
tf32: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
evals_per_epoch: 4
saves_per_epoch: 4
special_tokens:
fsdp_config:
fsdp_version: 2
offload_params: false
cpu_ram_efficient_loading: false
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Qwen3_5DecoderLayer
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true