A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-pro |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 66.80 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.67 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 1.7K IT + 10.1K OT = 11.8K TT | Cost: 0.002$ + 0.101$ = 0.103$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 1.64 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.02 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 5.1K IT + 5.9K OT = 11.0K TT | Cost: 0.010$ + 0.036$ = 0.046$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-small-2506 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 3.59 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.04 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 4.9K IT + 2.2K OT = 7.1K TT | Cost: 0.000$ + 0.001$ = 0.001$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.0-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 49.75 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.50 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 10.3K IT + 8.7K OT = 19.0K TT | Cost: 0.001$ + 0.003$ = 0.004$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 66.51 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.67 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 1.7K IT + 14.8K OT = 16.6K TT | Cost: 0.001$ + 0.037$ = 0.038$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | anthropic |
| Model | claude-opus-4-1-20250805 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 65.97 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.66 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 14.8K IT + 9.3K OT = 24.1K TT | Cost: 0.222$ + 0.696$ = 0.918$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash-lite-preview-09-2025 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 50.36 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.50 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 1.7K IT + 9.6K OT = 11.3K TT | Cost: 0.000$ + 0.004$ = 0.004$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-small-2506 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 3.48 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.03 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 4.9K IT + 1.9K OT = 6.9K TT | Cost: 0.000$ + 0.001$ = 0.001$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 52.22 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.52 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 7.9K IT + 8.3K OT = 16.2K TT | Cost: 0.010$ + 0.083$ = 0.093$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 66.92 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.67 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 1.7K IT + 13.2K OT = 15.0K TT | Cost: 0.001$ + 0.033$ = 0.034$ |