A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openrouter |
| Model | qwen/qwen3.7-plus-20260602 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 10.90 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 128.2K IT + 7.8K OT = 136.0K TT | Cost: 0.051$ + 0.012$ = 0.064$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | ministral-14b-2512 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 36.40 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 112.9K IT + 4.8K OT = 117.8K TT | Cost: 0.023$ + 0.001$ = 0.024$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | magistral-medium-2509 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 7.60 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 49.1K IT + 4.0K OT = 53.1K TT | Cost: 0.098$ + 0.020$ = 0.118$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 1.10 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 29.5K IT + 1.2K OT = 30.7K TT | Cost: 0.059$ + 0.007$ = 0.066$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | mistral |
| Model | ministral-8b-2512 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 23.50 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 112.9K IT + 9.8K OT = 122.7K TT | Cost: 0.017$ + 0.001$ = 0.018$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | genai |
| Model | gemini-3.5-flash |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 68.30 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 52.3K IT + 5.4K OT = 57.7K TT | Cost: 0.078$ + 0.048$ = 0.127$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openai |
| Model | gpt-5.5-2026-04-23 |
| Temperature | 1.0 |
| Dataclass | MagazinePage |
| Normalized Score | 95.60 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 158.1K IT + 52.5K OT = 210.7K TT | Cost: 0.791$ + 1.575$ = 2.366$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openrouter |
| Model | qwen/qwen3.6-plus-04-02 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 27.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 128.5K IT + 101.0K OT = 229.4K TT | Cost: 0.042$ + 0.197$ = 0.239$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openrouter |
| Model | qwen/qwen3.5-plus-20260216 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 33.20 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 128.8K IT + 28.6K OT = 157.3K TT | Cost: 0.033$ + 0.045$ = 0.078$ |
{'century': [20], 'document-type': ['newspaper-page'], 'language': ['en'], 'layout': ['prose', 'columns'], 'script': ['latin'], 'task': ['document-understanding'], 'writing': ['printed']}
| Provider | openrouter |
| Model | qwen/qwen3.5-122b-a10b-20260224 |
| Temperature | 0.0 |
| Dataclass | MagazinePage |
| Normalized Score | 0.00 % |
| Test time | unknown seconds |
Extract all advertisements and return their bounding boxes.
The original size of the page is {width} x {height} pixels.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 128.7K IT + 4.4K OT = 133.1K TT | Cost: 0.033$ + 0.009$ = 0.043$ |