Search Test Runs

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
Model configuration – provider, model version, temperature, and other generation parameters.
Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
Usage and cost data – token counts and calculated API costs.
Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Benchmark

Select the benchmark you want to search for. Leave blank to include all benchmarks.

Test

Choose a specific test within the selected benchmark. Each test defines a concrete prompt, temperature, dataclass, ....

Prompt

Search for a phrase or keyword occurring in the test prompt. Useful to find runs involving specific instructions.

Provider

Select the LLM service providers

Role

Filter runs by the role or system instruction assigned to the model (e.g. "You are a historian")

Model

Choose the specific model used in the run