A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | gpt-5-nano-2025-08-07 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 50.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 110.7K IT + 82.6K OT = 193.3K TT | Cost: 0.006$ + 0.033$ = 0.039$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | mistral-large-2411 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 12.3K IT + 25.7K OT = 37.9K TT | Cost: 0.025$ + 0.154$ = 0.178$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | ministral-14b-2512 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 11.7K IT + 506 OT = 12.2K TT | Cost: 0.002$ + 0.000$ = 0.002$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-3.1-flash-lite-preview |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 52.3K IT + 5.1K OT = 57.4K TT | Cost: 0.013$ + 0.008$ = 0.021$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | gpt-4o-mini-2024-07-18 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 7.60 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 1.7M IT + 1.5K OT = 1.7M TT | Cost: 0.256$ + 0.001$ = 0.256$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | anthropic |
| Model | claude-haiku-4-5-20251001 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 111.4K IT + 4.3K OT = 115.7K TT | Cost: 0.111$ + 0.022$ = 0.133$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | mistral-medium-2508 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.30 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 11.7K IT + 13.8K OT = 25.5K TT | Cost: 0.005$ + 0.028$ = 0.032$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | anthropic |
| Model | claude-opus-4-20250514 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 2.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 98.8K IT + 4.5K OT = 103.3K TT | Cost: 1.482$ + 0.335$ = 1.817$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-2.5-flash-lite-preview-09-2025 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 13.9K IT + 5.4K OT = 19.4K TT | Cost: 0.001$ + 0.002$ = 0.004$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-2.5-flash-lite |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 13.9K IT + 397.9K OT = 411.8K TT | Cost: 0.001$ + 0.159$ = 0.161$ |