A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
- Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
- Model configuration – provider, model version, temperature, and other generation parameters.
- Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
- Usage and cost data – token counts and calculated API costs.
- Metadata – information like the test date, benchmark name, and person who executed it.
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.