A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | magistral-medium-2509 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 53.74 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 30.7K OT = 71.1K TT | Cost: 0.081$ + 0.154$ = 0.234$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-2.0-flash-lite |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 50.75 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 35.4K IT + 196.2K OT = 231.6K TT | Cost: 0.003$ + 0.059$ = 0.062$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-2.5-flash-lite |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 63.59 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.5K IT + 1.0M OT = 1.0M TT | Cost: 0.003$ + 0.401$ = 0.405$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | anthropic |
| Model | claude-opus-4-20250514 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 96.51 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 60.4K IT + 36.0K OT = 96.5K TT | Cost: 0.907$ + 2.701$ = 3.608$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-4.1-nano-2025-04-14 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 93.35 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.8K IT + 25.4K OT = 59.1K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-small-2506 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 91.73 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 30.3K OT = 70.6K TT | Cost: 0.004$ + 0.009$ = 0.013$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-4o-mini-2024-07-18 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 91.81 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.9K IT + 56.2K OT = 90.0K TT | Cost: 0.000$ + 0.000$ = 0.000$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-medium-2508 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 83.67 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 31.8K OT = 72.1K TT | Cost: 0.016$ + 0.064$ = 0.080$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | anthropic |
| Model | claude-sonnet-4-20250514 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 96.99 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 60.4K IT + 35.2K OT = 95.7K TT | Cost: 0.181$ + 0.528$ = 0.710$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-2.5-pro |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 91.54 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.5K IT + 28.6K OT = 60.1K TT | Cost: 0.039$ + 0.286$ = 0.325$ |