A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openrouter |
| Model | qwen/qwen3.7-plus-20260602 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 79.93 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 38.6K IT + 323.0K OT = 361.7K TT | Cost: 0.015$ + 0.517$ = 0.532$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-medium-2505 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 83.61 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 33.3K OT = 73.7K TT | Cost: 0.016$ + 0.067$ = 0.083$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | mistral-medium-3.5 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 96.09 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 40.3K IT + 31.8K OT = 72.1K TT | Cost: 0.060$ + 0.239$ = 0.299$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | anthropic |
| Model | claude-opus-4-8 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 95.04 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 81.4K IT + 58.0K OT = 139.5K TT | Cost: 0.407$ + 1.451$ = 1.858$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | mistral |
| Model | pixtral-large-2411 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 47.99 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 22.8K IT + 20.1K OT = 42.8K TT | Cost: 0.046$ + 0.120$ = 0.166$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | genai |
| Model | gemini-3.5-flash |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 80.61 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 31.5K IT + 515.5K OT = 547.0K TT | Cost: 0.047$ + 4.639$ = 4.687$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openai |
| Model | gpt-5.5-2026-04-23 |
| Temperature | 1.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 98.48 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 33.6K IT + 62.5K OT = 96.1K TT | Cost: 0.168$ + 1.875$ = 2.043$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | deepseek |
| Model | deepseek-v4-pro |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 97.42 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 39.4K IT + 412.5K OT = 452.0K TT | Cost: 0.006$ + 1.436$ = 1.441$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | deepseek |
| Model | deepseek-v4-flash |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 97.01 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 39.4K IT + 209.8K OT = 249.3K TT | Cost: 0.001$ + 0.059$ = 0.060$ |
{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}
| Provider | openrouter |
| Model | qwen/qwen3.5-9b-20260310 |
| Temperature | 0.0 |
| Dataclass | CorrectedAdvert |
| Normalized Score | 100.00 % |
| Test time | unknown seconds |
Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| 18.80 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 35.1K IT + 2.0M OT = 2.1M TT | Cost: 0.004$ + 0.303$ = 0.306$ |