Benchmarking Llama, Mistral, Gemma, DeepSeek and GPT for factuality, toxicity, bias, instruction following, avoiding jailbreaks and propensity for hallucinations
Note
UPDATED February 10th, 2025:
- Benchmarking latest open source LLMs including DeepSeek, OLMo-2, etc.
- All datasets revised: 100s of ground truth corrections.
- Model tested are all 'small LLMs' (7B-12B parameters) except for GPT-4o added to the benchmark as 'upper limit'.
Extended Results (Feb. 10th 2025) |
Paper | Datasets | Red teaming tool |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
We ran the benchmark on a server with 1 x NVIDIA A100 80GB.
Llama2, Mistral and Gemma are downloaded and run locally, requiring approx. 90Gb disk space.
python3.11 -m venv .venv
. .venv/bin/activate
pip install wheel pip -U
pip install -r requirements.txt
(Works on Python3.10 as well.)
In order to download Huggingface datasets and models you need a token.
Benchmark uses 14 datasets, 3 of which are gated and you need to request access here, here and here.
Llama2 is gated model, you need to request access.
Gemma is gated model, you need to request access.
In order to call OpenAI API, you need a key.
Export secret keys to environment variables:
export HF_TOKEN=xyz
export OPENAI_API_KEY=xyz
When running a benchmark, first declare the folder where the data will be stored, for instance:
export REDLITE_DATA_DIR=./data
The following script does it all:
python run_all.py
Original benchmark runs in ~24 hours on a GPU server.
Once completed, you can launch a local web app to visualize the benchmark:
redlite server