A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5-mini |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 64.94 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.65 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4.1-nano |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 32.97 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.33 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4.1-mini |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 61.46 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.61 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4.1 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 54.44 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.54 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | anthropic |
| Model | claude-opus-4-20250514 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 32.67 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.33 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | o3 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 64.61 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.65 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5-nano |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 33.02 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.33 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | anthropic |
| Model | claude-opus-4-1-20250805 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 29.62 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.30 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 62.65 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.63 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5-mini |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 63.76 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.64 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |