language:

en license: llama3.1 library_name: transformers tags:
mergekit
merge base_model:
meta-llama/Meta-Llama-3.1-70B-Instruct
NousResearch/Hermes-3-Llama-3.1-70B
abacusai/Dracarys-Llama-3.1-70B-Instruct
VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct model-index:
name: Brinebreath-Llama-3.1-70B results:
- task: type: text-generation name: Text Generation dataset: name: IFEval (0-Shot) type: HuggingFaceH4/ifeval args: num_few_shot: 0 metrics:
  - type: inst_level_strict_acc and prompt_level_strict_acc value: 55.33 name: strict accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: BBH (3-Shot) type: BBH args: num_few_shot: 3 metrics:
  - type: acc_norm value: 55.46 name: normalized accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: MATH Lvl 5 (4-Shot) type: hendrycks/competition_math args: num_few_shot: 4 metrics:
  - type: exact_match value: 29.98 name: exact match source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: GPQA (0-shot) type: Idavidrein/gpqa args: num_few_shot: 0 metrics:
  - type: acc_norm value: 12.86 name: acc_norm source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: MuSR (0-shot) type: TAUR-Lab/MuSR args: num_few_shot: 0 metrics:
  - type: acc_norm value: 17.49 name: acc_norm source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B name: Open LLM Leaderboard
- task: type: text-generation name: Text Generation dataset: name: MMLU-PRO (5-shot) type: TIGER-Lab/MMLU-Pro config: main split: test args: num_few_shot: 5 metrics:
  - type: acc value: 46.62 name: accuracy source: url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=gbueno86/Brinebreath-Llama-3.1-70B name: Open LLM Leaderboard

Brinebreath-Llama-3.1-70B

I made this since I started having some problems with Cathallama. This seems to behave well during some days testing.

Notable Performance

7% overall success rate increase on MMLU-PRO over LLaMA 3.1 70b at Q4_0
Strong performance in MMLU-PRO categories overall
Great performance during manual testing

Creation workflow

Models merged

meta-llama/Meta-Llama-3.1-70B-Instruct
NousResearch/Hermes-3-Llama-3.1-70B
abacusai/Dracarys-Llama-3.1-70B-Instruct
VAGOsolutions/Llama-3.1-SauerkrautLM-70b-Instruct

flowchart TD
    A[Hermes 3] -->|Merge with| B[Meta-Llama-3.1]
    C[Dracarys] -->|Merge with| D[Meta-Llama-3.1]
    B -->| | E[Merge]
    D -->| | E[Merge]
    G[SauerkrautLM] -->|Merge with| E[Merge]
    E[Merge] -->| | F[Brinebreath]

image/png

Testing

Hyperparameters

Temperature: 0.0 for automated, 0.9 for manual
Penalize repeat sequence: 1.05
Consider N tokens for penalize: 256
Penalize repetition of newlines
Top-K sampling: 40
Top-P sampling: 0.95
Min-P sampling: 0.05

LLaMAcpp Version

b3600-1-g2339a0be
-fa -ngl -1 -ctk f16 --no-mmap

Tested Files

Brinebreath-Llama-3.1-70B.Q4_0.gguf
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf

Manual testing

Category	Test Case	Brinebreath-Llama-3.1-70B.Q4_0.gguf	Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Common Sense	Ball on cup	OK	OK
	Big duck small horse	OK	OK
	Killers	OK	OK
	Strawberry r's	KO	KO
	9.11 or 9.9 bigger	KO	KO
	Dragon or lens	KO	KO
	Shirts	OK	KO
	Sisters	OK	KO
	Jane faster	OK	OK
Programming	JSON	OK	OK
	Python snake game	OK	KO
Math	Door window combination	OK	KO
Smoke	Poem	OK	OK
	Story	OK	OK

Note: See sample_generations.txt on the main folder of the repo for the raw generations.

MMLU-PRO

Model	Success %
Brinebreath-3.1-70B.Q4_0.gguf	49.0%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf	42.0%

MMLU-PRO category	Brinebreath-3.1-70B.Q4_0.gguf	Meta-Llama-3.1-70B-Instruct.Q4_0.gguf
Business	45.0%	40.0%
Law	40.0%	35.0%
Psychology	85.0%	80.0%
Biology	80.0%	75.0%
Chemistry	50.0%	45.0%
History	65.0%	60.0%
Other	55.0%	50.0%
Health	70.0%	65.0%
Economics	80.0%	75.0%
Math	35.0%	30.0%
Physics	45.0%	40.0%
Computer Science	60.0%	55.0%
Philosophy	50.0%	45.0%
Engineering	45.0%	40.0%

Note: MMLU-PRO Overall tested with 100 questions. Categories testes with 20 questions from each category.

PubmedQA

Model Name	Success%
Brinebreath-3.1-70B.Q4_0.gguf	71.00%
Meta-Llama-3.1-70B-Instruct.Q4_0.gguf	68.00%

Note: PubmedQA tested with 100 questions.

Request

If you are hiring in the EU or can sponsor a visa, PM me :D

PS. Thank you mradermacher for the GGUFs!

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	36.29
IFEval (0-Shot)	55.33
BBH (3-Shot)	55.46
MATH Lvl 5 (4-Shot)	29.98
GPQA (0-shot)	12.86
MuSR (0-shot)	17.49
MMLU-PRO (5-shot)	46.62

Llama-3.3+(3.1v3.3)-70B-Brinebreath

Hourly Usage

Performance Metrics

Model Information