A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5-mini-2025-08-07 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 67.67 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.68 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 12.3K IT + 23.2K OT = 35.5K TT | Cost: 0.003$ + 0.046$ = 0.050$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | ministral-14b-2512 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 5.56 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.06 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 4.9K IT + 3.5K OT = 8.5K TT | Cost: 0.001$ + 0.001$ = 0.002$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | x-ai/grok-4 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 30.71 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.31 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 8.3K IT + 103.9K OT = 112.1K TT | Cost: 0.025$ + 1.558$ = 1.583$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | qwen/qwen3-vl-8b-instruct |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 53.87 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.54 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 12.8K IT + 13.7K OT = 26.4K TT | Cost: 0.001$ + 0.007$ = 0.008$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | ministral-8b-2512 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 4.61 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.05 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 4.9K IT + 2.9K OT = 7.8K TT | Cost: 0.001$ + 0.000$ = 0.001$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | magistral-small-2509 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.00 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 4.9K IT + 5.1K OT = 10.1K TT | Cost: 0.002$ + 0.003$ = 0.005$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4o-mini-2024-07-18 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 28.95 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.29 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 187.5K IT + 9.4K OT = 196.9K TT | Cost: 0.028$ + 0.006$ = 0.034$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 66.51 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.67 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 1.7K IT + 14.8K OT = 16.6K TT | Cost: 0.001$ + 0.037$ = 0.038$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | magistral-medium-2509 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.00 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 4.9K IT + 116 OT = 5.1K TT | Cost: 0.010$ + 0.001$ = 0.010$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-large-2411 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 5.46 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.05 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 5.1K IT + 2.5K OT = 7.5K TT | Cost: 0.010$ + 0.015$ = 0.025$ |