Avg. Total Time
29.57s
Avg. TTFT
12.55s
Avg. Prefill TPS
445.10
Avg. Gen TPS
22.01
Context Size
262144
Quantization
r64
Engine
vllm
Creation Method
LoRA Finetune
Model Type
Gemma31B
Chat Template
Gemma4
Reasoning
Yes
Vision
Yes
Parameters
31B
Added At
5/1/2026
license: apache-2.0 base_model:

Full parameter fine-tune of google/gemma-4-31B-it on 12,680 Claude Opus 4.6 reasoning traces.
First full-parameter fine-tune of Gemma 4 31B.
| Base | google/gemma-4-31B-it |
| Method | Full parameter SFT (not LoRA) |
| Framework | TRL SFTTrainer + PyTorch FSDP |
| Hardware | 8x NVIDIA H200 (141GB each) |
| Precision | bf16 |
| Total epochs | 4 (2 at lr=1e-5, then 2 more at lr=5e-6) |
| Sequence length | 8,192 |
| Batch size (effective) | 10 |
Two-phase approach for optimal convergence:
| Phase | Epochs | Learning rate | Result |
|---|---|---|---|
| Initial | 2 | 1e-5 (cosine) | 80.8% accuracy |
| Continued | 2 | 5e-6 (cosine) | 89.7% accuracy |
Continuing at lower LR on a warm checkpoint improved accuracy by 9 percentage points.
| Metric | After phase 1 | After phase 2 (final) |
|---|---|---|
| Loss | 27.5 | 13.6 |
| Token accuracy | 80.8% | 89.7% |
| Grad norm | 15.3 | 15.3 |
| Entropy | 0.69 | 0.34 |
All Claude Opus 4.6. No mixed-model data.
| Dataset | Samples | Description |
|---|---|---|
| Crownelius/Opus-4.6-Reasoning-3300x | 2,160 | Cleaned Claude Opus 4.6 reasoning — math, code, diverse |
| TeichAI/Claude-Opus-4.6-Reasoning-887x | 887 | Tool-use reasoning + vague prompt handling |
| Roman1111111/claude-opus-4.6-10000x | 9,633 | Math/logic reasoning with verified solutions |
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"EganAI/gemma4-31b-opus-reasoning",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("EganAI/gemma4-31b-opus-reasoning")
messages = [
{"role": "user", "content": "Prove that the square root of 2 is irrational."},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=2048, temperature=1.0, top_p=0.95, top_k=64
)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))
| Format | VRAM | Device |
|---|---|---|
| bf16 | ~62GB | 1x A100/H100 80GB |
| Q8 | ~31GB | 2x RTX 4090 |
| Q4_K_M | ~17GB | RTX 4090 |
| Q3_K_M | ~14GB | RTX 4080 |
Apache 2.0 (same as Gemma 4)