A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 59.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.54 | 0.59 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 72.9K IT + 2.0K OT = 75.0K TT | Cost: 0.091$ + 0.020$ = 0.111$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 54.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.52 | 0.54 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 44.4K IT + 3.8K OT = 48.2K TT | Cost: 0.013$ + 0.009$ = 0.023$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 51.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.48 | 0.51 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 27.0K IT + 2.2K OT = 29.2K TT | Cost: 0.008$ + 0.006$ = 0.014$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 45.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.52 | 0.45 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 17.4K IT + 1.7K OT = 19.2K TT | Cost: 0.005$ + 0.004$ = 0.010$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | qwen/qwen3-vl-8b-thinking |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 12.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.10 | 0.12 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: 5 months ago, 2025-10-17. | Tokens: 101.5K IT + 19.1K OT = 120.6K TT | Cost: 0.018$ + 0.040$ = 0.058$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | meta-llama/llama-4-maverick |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 48.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.45 | 0.48 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: 5 months ago, 2025-10-17. | Tokens: 128.7K IT + 2.0K OT = 130.7K TT | Cost: 0.019$ + 0.001$ = 0.021$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | qwen/qwen3-vl-8b-thinking |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 7.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.06 | 0.07 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: 5 months ago, 2025-10-17. | Tokens: 160.2K IT + 14.0K OT = 174.2K TT | Cost: 0.029$ + 0.029$ = 0.058$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | qwen/qwen3-vl-8b-thinking |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 29.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.19 | 0.29 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: 5 months ago, 2025-10-17. | Tokens: 58.7K IT + 11.0K OT = 69.7K TT | Cost: 0.011$ + 0.023$ = 0.034$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | meta-llama/llama-4-maverick |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 38.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.47 | 0.38 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: 5 months ago, 2025-10-17. | Tokens: 73.9K IT + 1.5K OT = 75.4K TT | Cost: 0.011$ + 0.001$ = 0.012$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | meta-llama/llama-4-maverick |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 46.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.47 | 0.46 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: 5 months ago, 2025-10-17. | Tokens: 203.7K IT + 3.6K OT = 207.3K TT | Cost: 0.031$ + 0.002$ = 0.033$ |