A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-pro |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 25.79 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.26 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | anthropic |
| Model | claude-opus-4-20250514 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 27.41 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.27 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | anthropic |
| Model | claude-sonnet-4-20250514 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 15.65 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.16 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | pixtral-large-latest |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 4.37 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.04 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.0-flash-lite |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 12.93 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.13 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.0-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 8.44 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.08 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4o-mini |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 53.94 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.54 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4o |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 52.04 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.52 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.0-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.00 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |
{'document-type': ['book-page'], 'writing': ['printed'], 'century': [20], 'language': ['en'], 'layout': ['list'], 'entry-type': ['bibliographic'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-4o |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 61.31 % |
| Test time | unknown seconds |
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 0.61 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: n/a IT + n/a OT = n/a TT | Cost: n/a$ + n/a$ = n/a$ |