RISE Humanities Data Benchmark, 0.4.0

Search Test Runs

 

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

  • Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
  • Model configuration – provider, model version, temperature, and other generation parameters.
  • Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
  • Usage and cost data – token counts and calculated API costs.
  • Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Select the benchmark you want to search for. Leave blank to include all benchmarks.
Choose a specific test within the selected benchmark. Each test defines a concrete prompt, temperature, dataclass, ....
Search for a phrase or keyword occurring in the test prompt. Useful to find runs involving specific instructions.
Select the LLM service providers
Filter runs by the role or system instruction assigned to the model (e.g. "You are a historian")
Choose the specific model used in the run
Search for tags or labels associated with the benchmark (and thus with the test)
Filter by the person who designed, annotated, scored, executed or uploaded the test run.
Filter by the normalized evaluation score (0–100). Use comparison operators (=, >, <) to set thresholds.
Comparison operator
Restrict results to test runs executed within a specific date or range.
Date comparison
Filter by the sampling temperature used during generation (0 = deterministic, 1 = creative).
Comparison operator
Filter by the total API cost (USD) of a test run. Use comparison operators to set limits.
Comparison operator
Filter by the total number of tokens processed (input + output). Useful for performance or cost analysis.
Comparison operator
If enabled, show hidden test runs: test benchmarks and legacy tests