innodata-llm-safety

Benchmarking Llama, Mistral, Gemma, DeepSeek and GPT for factuality, toxicity, bias, instruction following, avoiding jailbreaks and propensity for hallucinations

Public Safety Benchmark	Paper	Datasets	Red teaming tool

[!NOTE]

UPDATED March 12th, 2025:

Added Qwen2.5-7B

UPDATED February 10th, 2025:

Benchmarking latest open source LLMs including DeepSeek, OLMo-2, etc.

All datasets revised: 100s of ground truth corrections.

Model tested are all 'small LLMs' (7B-12B parameters) except for GPT-4o added to the benchmark as 'upper limit'.

UPDATED August 19th, 2024:

Benchmarking latest open source LLMs: Gemma-2, Llama3 & 3.1, Mistral v0.3 & Mistral-Nemo, OLMo;

Contributing 13 new open source datasets for PII, instruction-following, hallucinations, bias, jailbreaking and general safety;

Local models require additional 110Gb disk space;

Extended benchmark runs in 4 days on a GPU server.

Reproducing our Research

Required hardware

We ran the benchmark on a server with 1 x NVIDIA A100 80GB.

Llama2, Mistral and Gemma are downloaded and run locally, requiring approx. 90Gb disk space.

Set up

python3.11 -m venv .venv
. .venv/bin/activate
pip install wheel pip -U
pip install -r requirements.txt

(Works on Python3.10 as well.)

Required tokens and environment variables

In order to download Huggingface datasets and models you need a token.

Benchmark uses 14 datasets, 3 of which are gated and you need to request access here, here and here.

Llama2 is gated model, you need to request access.

Gemma is gated model, you need to request access.

In order to call OpenAI API, you need a key.

Export secret keys to environment variables:

export HF_TOKEN=xyz
export OPENAI_API_KEY=xyz

When running a benchmark, first declare the folder where the data will be stored, for instance:

export REDLITE_DATA_DIR=./data

Run the benchmark

The following script does it all:

python run_all.py

Original benchmark runs in ~24 hours on a GPU server.

Visualize

Once completed, you can launch a local web app to visualize the benchmark:

redlite server

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
img		img
toxicity_benchmark		toxicity_benchmark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_all.py		run_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

innodata-llm-safety

Reproducing our Research

Required hardware

Set up

Required tokens and environment variables

Run the benchmark

Visualize

About

Releases 1

Packages

Contributors 2

Languages

License

innodatalabs/innodata-llm-safety

Folders and files

Latest commit

History

Repository files navigation

innodata-llm-safety

Reproducing our Research

Required hardware

Set up

Required tokens and environment variables

Run the benchmark

Visualize

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages