A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 90.78 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.5K IT + 74.8K OT = 106.3K TT | Cost: 0.009$ + 0.187$ = 0.196$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-medium-2505 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 84.58 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 32.3K OT = 72.6K TT | Cost: 0.016$ + 0.065$ = 0.081$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 86.42 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 44.8K IT + 46.4K OT = 91.2K TT | Cost: 0.090$ + 0.278$ = 0.368$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-5-nano-2025-08-07 |
| Temperature | 1.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 94.18 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.6K IT + 600.3K OT = 633.9K TT | Cost: 0.002$ + 0.240$ = 0.242$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-5-mini-2025-08-07 |
| Temperature | 1.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 91.12 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.7K IT + 176.7K OT = 210.4K TT | Cost: 0.008$ + 0.353$ = 0.362$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 1.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 95.68 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.6K IT + 36.9K OT = 70.5K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | meta-llama/llama-4-maverick |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 67.18 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 29.7K IT + 24.8K OT = 54.5K TT | Cost: 0.000$ + 0.000$ = 0.003$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-4.1-mini-2025-04-14 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 90.13 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.9K IT + 61.1K OT = 95.0K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | x-ai/grok-4 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 95.09 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 68.1K IT + 452.5K OT = 520.7K TT | Cost: 0.000$ + 0.000$ = 0.143$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | qwen/qwen3-vl-8b-thinking |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 1.91 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 21.0K IT + 23.2K OT = 44.2K TT | Cost: 0.000$ + 0.000$ = 0.000$ |