A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | x-ai |
| Model | grok-4.3 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 98.60 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.7K IT + 26.3K OT = 66.9K TT | Cost: 0.051$ + 0.066$ = 0.116$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openrouter |
| Model | stepfun/step-3.7-flash-20260528 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 86.28 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.7K IT + 1.1M OT = 1.1M TT | Cost: 0.006$ + 1.240$ = 1.247$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openrouter |
| Model | meta-llama/llama-4-scout-17b-16e-instruct |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 56.61 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 29.1K IT + 17.5K OT = 46.6K TT | Cost: 0.002$ + 0.005$ = 0.008$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-large-2512 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 56.95 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 24.3K IT + 21.9K OT = 46.2K TT | Cost: 0.012$ + 0.033$ = 0.045$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-small-2506 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 91.76 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 29.9K OT = 70.2K TT | Cost: 0.004$ + 0.009$ = 0.013$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openrouter |
| Model | qwen/qwen3.7-plus-20260602 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 79.93 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 38.6K IT + 323.0K OT = 361.7K TT | Cost: 0.015$ + 0.517$ = 0.532$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 47.99 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 22.8K IT + 20.1K OT = 42.8K TT | Cost: 0.046$ + 0.120$ = 0.166$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-medium-2508 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 85.19 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 34.3K OT = 74.6K TT | Cost: 0.016$ + 0.069$ = 0.085$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-medium-2505 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 83.61 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 33.3K OT = 73.7K TT | Cost: 0.016$ + 0.067$ = 0.083$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | magistral-medium-2509 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 11.74 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 5.1K IT + 4.1K OT = 9.2K TT | Cost: 0.010$ + 0.021$ = 0.031$ |