A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-medium-3.5 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 64.22 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.64 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 15.0K IT + 8.6K OT = 23.6K TT | Cost: 0.023$ + 0.065$ = 0.087$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-large-2512 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 61.50 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.62 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 15.0K IT + 8.7K OT = 23.7K TT | Cost: 0.008$ + 0.013$ = 0.020$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-medium-2505 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 67.32 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.67 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 15.0K IT + 8.9K OT = 23.9K TT | Cost: 0.006$ + 0.018$ = 0.024$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | mistral-medium-2508 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 77.10 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.77 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 15.0K IT + 9.3K OT = 24.3K TT | Cost: 0.006$ + 0.019$ = 0.025$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | magistral-medium-2509 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 53.82 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.54 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 15.0K IT + 5.4K OT = 20.4K TT | Cost: 0.030$ + 0.027$ = 0.057$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-3.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 62.99 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.63 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 5.8K IT + 10.7K OT = 16.5K TT | Cost: 0.009$ + 0.096$ = 0.105$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5.5-2026-04-23 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 71.55 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.72 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 18.1K IT + 15.2K OT = 33.3K TT | Cost: 0.091$ + 0.456$ = 0.547$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | qwen/qwen3.5-9b-20260310 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.00 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 16.5K IT + 71.9K OT = 88.4K TT | Cost: 0.002$ + 0.011$ = 0.012$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | qwen/qwen3.5-397b-a17b-20260216 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 59.79 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.60 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 15.1K IT + 33.7K OT = 48.9K TT | Cost: 0.006$ + 0.079$ = 0.085$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | anthropic |
| Model | claude-opus-4-7 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 69.89 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.70 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 28.6K IT + 9.5K OT = 38.0K TT | Cost: 0.143$ + 0.237$ = 0.380$ |