A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 86.42 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 44.8K IT + 46.4K OT = 91.2K TT | Cost: 0.090$ + 0.278$ = 0.368$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 90.78 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.5K IT + 74.8K OT = 106.3K TT | Cost: 0.009$ + 0.187$ = 0.196$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-large-2411 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 79.08 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 44.8K IT + 69.3K OT = 114.1K TT | Cost: 0.090$ + 0.416$ = 0.505$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | qwen/qwen3-vl-8b-instruct |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 54.90 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 37.2K IT + 27.4K OT = 64.6K TT | Cost: 0.000$ + 0.000$ = 0.006$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 1.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 95.68 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.6K IT + 36.9K OT = 70.5K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | meta-llama/llama-4-maverick |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 67.18 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 29.7K IT + 24.8K OT = 54.5K TT | Cost: 0.000$ + 0.000$ = 0.003$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | qwen/qwen3-vl-8b-thinking |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 1.91 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 21.0K IT + 23.2K OT = 44.2K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | anthropic |
| Model | claude-opus-4-1-20250805 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 96.62 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 60.4K IT + 35.8K OT = 96.2K TT | Cost: 0.907$ + 2.685$ = 3.592$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-4.1-mini-2025-04-14 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 90.13 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.9K IT + 61.1K OT = 95.0K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-2.5-flash-lite-preview-09-2025 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 67.71 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.5K IT + 816.8K OT = 848.3K TT | Cost: 0.003$ + 0.327$ = 0.330$ |