A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | magistral-small-2509 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 5.1K IT + 180 OT = 5.3K TT | Cost: 0.003$ + 0.000$ = 0.003$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | anthropic |
| Model | claude-sonnet-4-5-20250929 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 2.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 111.4K IT + 4.4K OT = 115.8K TT | Cost: 0.334$ + 0.066$ = 0.400$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-2.0-flash-lite |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 9.80 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 88.9K IT + 42.3K OT = 131.2K TT | Cost: 0.007$ + 0.013$ = 0.019$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | mistral-medium-2505 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 11.7K IT + 10.5K OT = 22.2K TT | Cost: 0.005$ + 0.021$ = 0.026$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | x-ai |
| Model | grok-4.20-0309-reasoning |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 49.10 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 24.4K IT + 2.7K OT = 27.1K TT | Cost: 0.049$ + 0.016$ = 0.065$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | ministral-8b-2512 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.30 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 11.7K IT + 24.1K OT = 35.7K TT | Cost: 0.002$ + 0.004$ = 0.005$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 56.30 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 51.0K IT + 2.5K OT = 53.5K TT | Cost: 0.064$ + 0.025$ = 0.089$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | mistral-large-2512 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 2.10 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 11.7K IT + 14.6K OT = 26.3K TT | Cost: 0.006$ + 0.022$ = 0.028$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | meta-llama/llama-4-maverick-17b-128e-instruct |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 2.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 121.5K IT + 3.1K OT = 124.6K TT | Cost: 0.000$ + 0.000$ = 0.002$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | gpt-4.1-2025-04-14 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 9.80 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 60.0K IT + 1.8K OT = 61.8K TT | Cost: 0.120$ + 0.014$ = 0.134$ |