A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 7.90 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 12.3K IT + 4.8K OT = 17.1K TT | Cost: 0.025$ + 0.029$ = 0.053$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-2.5-flash-lite-preview-09-2025 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 13.9K IT + 5.4K OT = 19.4K TT | Cost: 0.001$ + 0.002$ = 0.004$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | anthropic |
| Model | claude-opus-4-6 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 21.50 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 111.4K IT + 3.8K OT = 115.2K TT | Cost: 0.557$ + 0.096$ = 0.653$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openrouter |
| Model | qwen/qwen3-vl-8b-thinking |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 128.5K IT + 4.8K OT = 133.3K TT | Cost: 0.015$ + 0.006$ = 0.022$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | mistral-small-2506 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 6.60 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 11.7K IT + 8.7K OT = 20.4K TT | Cost: 0.001$ + 0.003$ = 0.004$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | gpt-4.1-nano-2025-04-14 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 2.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 175.9K IT + 1.4K OT = 177.2K TT | Cost: 0.018$ + 0.001$ = 0.018$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-3-flash-preview |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 84.80 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 52.3K IT + 3.2K OT = 55.5K TT | Cost: 0.026$ + 0.010$ = 0.036$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | qwen/qwen3-vl-8b-instruct |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 131.2K IT + 4.4K OT = 135.6K TT | Cost: 0.000$ + 0.000$ = 0.004$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | qwen/qwen3-vl-30b-a3b-instruct |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 131.2K IT + 39.4K OT = 170.6K TT | Cost: 0.000$ + 0.000$ = 0.021$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | o3-2025-04-16 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 75.40 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 54.0K IT + 31.9K OT = 85.9K TT | Cost: 0.108$ + 0.255$ = 0.363$ |