Phare Benchmark

Phare is a multilingual benchmark that measures LLM Safety across multiple categories of vulnerabilities, including hallucination, biases & stereotypes and harmful content.

Background and Motivation

Large Language Models (LLMs) have rapidly advanced in capability and adoption, becoming essential tools across a wide spectrum of natural language processing applications. While existing benchmarks have focused primarily on general performance metrics, such as accuracy and task completion, there is a growing recognition of the need to evaluate these models through the lens of safety and robustness. Concerns over hallucination, harmful outputs, and social bias have escalated alongside model deployment in sensitive and high-impact settings.

However, current safety evaluations tend to be fragmented or limited in scope, lacking unified diagnostic tools that deeply probe model behavior. In response to this gap, we introduce Phare, a multilingual and multi-faceted evaluation framework designed specifically to diagnose and analyze LLM failure modes across hallucination, bias, and harmful content. Phare aims at contributing a tool for the development of safer and more trustworthy AI systems.

Usage

Phare is easy to use and reproducible. You can set up and run the benchmark with just a few commands using the uv package manager.

Install notes

Install uv
Clone this repo:

git clone https://github.com/Giskard-AI/phare

Install the requirements:

uv sync
source .venv/bin/activate

Setup secrets: Running the benchmark will requires tokens for calling the different models. Here is a list of expected env variables:

OPENAI_API_KEY
GEMINI_API_KEY
ANTHROPIC_API_KEY
OPENROUTER_API_KEY

Benchmark setup

To setup the benchmark, simply run:

python 01_setup_benchmark.py --config_path <path_to_config>.yaml --save_path <path_to_save_benchmark>.db

The Hugging Face repository and the path to the files for each submodule should be set in benchmark_config.yaml, under the hf_dataset and the data_path keys. Each category should have the following structure:

name: <category_name>
hf_dataset: giskardai/phare
data_path: <path_to_data>
tasks:
    - name: <task_name>
      scorer: <scorer_name>
      type: <task_type>
      description: <task_description>

Each task should provide a name, type, description and its associated scorer. Path to data should point to the folder under the Hugging Face repository, containing the jsonl files for each tasks.

For example, in the giskardai/phare repository, the hallucination/debunking as the <path_to_data> with misconceptions as <task_name> indicates the hallucination/debunking/misconceptions.jsonl file.

Inside the jsonl files, each line should have the following format:

{
    "id": "question_uuid",
    "messages": [{"role": "user", "content": "..."}, ...],
    "metadata": {
        "task": "category_name/task_name",
        "language": "en",

    },
    "evaluation_data": {
        ...
    }
}

Run the benchmark

To run the benchmark, simply run:

python 02_run_benchmark.py <path_to_benchmark.db> --max_evaluations_per_task <int>

The max_evaluations_per_task argument is optional, it sets the maximum number of evaluations per task.

Extensibility

This Phare benchmark is also designed to be easily extensible, thanks to LM-eval. You can add new categories, tasks, and models by following the instructions below.

Add a new category

To add a new task, follow these steps:

Add it in the benchmark_config.yaml file, with the correct data_path and a list of tasks.
Implement the required scorers used in the tasks of the categories in the scorers folder and add it to the SCORERS inside scorers/get_scorer.py.

Add a model

To add a new model, simply add it in the benchmark_config.yaml file, under the models key. You can also change the evaluation models in the evaluation_models key.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
results		results
scorers		scorers
.gitignore		.gitignore
.python-version		.python-version
01_setup_benchmark.py		01_setup_benchmark.py
02_run_benchmark.py		02_run_benchmark.py
README.md		README.md
benchmark_config.yaml		benchmark_config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phare Benchmark

Background and Motivation

Usage

Install notes

Benchmark setup

Run the benchmark

Extensibility

Add a new category

Add a model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Giskard-AI/phare

Folders and files

Latest commit

History

Repository files navigation

Phare Benchmark

Background and Motivation

Usage

Install notes

Benchmark setup

Run the benchmark

Extensibility

Add a new category

Add a model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages