A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | x-ai/grok-4 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 11.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.09 | 0.11 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 54.6K IT + 122.6K OT = 177.3K TT | Cost: 0.164$ + 1.840$ = 2.003$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | magistral-small-2509 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | None % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| None | None | None | None | None | None | None | None | None |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: None, None. | Tokens: None IT + None OT = None TT | Cost: None$ + None$ = None$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | mistral |
| Model | magistral-small-2509 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | None % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| None | None | None | None | None | None | None | None | None |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: None, None. | Tokens: None IT + None OT = None TT | Cost: None$ + None$ = None$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 60.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.55 | 0.60 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 72.9K IT + 2.0K OT = 75.0K TT | Cost: 0.091$ + 0.020$ = 0.111$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openrouter |
| Model | x-ai/grok-4 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 8.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.07 | 0.08 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 93.7K IT + 200.8K OT = 294.5K TT | Cost: 0.281$ + 3.012$ = 3.293$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | openai |
| Model | gpt-5.1-2025-11-13 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 47.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.57 | 0.47 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 44.0K IT + 1.5K OT = 45.5K TT | Cost: 0.055$ + 0.015$ = 0.070$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-3-flash-preview |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 72.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.84 | 0.72 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 46.6K IT + 1.2K OT = 47.8K TT | Cost: 0.023$ + 0.003$ = 0.027$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 52.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.52 | 0.52 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 44.4K IT + 3.7K OT = 48.2K TT | Cost: 0.013$ + 0.009$ = 0.023$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash-lite-preview-09-2025 |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 51.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.46 | 0.51 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 27.0K IT + 3.6K OT = 30.6K TT | Cost: 0.003$ + 0.001$ = 0.004$ |
{'document-type': ['letter'], 'writing': ['typed', 'handwritten'], 'century': [20], 'language': ['de'], 'layout': ['prose'], 'entry-type': ['person', 'location'], 'task': ['information-extraction']}
| Provider | genai |
| Model | gemini-2.5-flash |
| Temperature | 0.0 |
| Dataclass | Document |
| Normalized Score | 52.00 % |
| Test time | unknown seconds |
IDENTITY and PURPOSE:
You are presented with a series of images constituting a historical letter. Your task is to extract the
values of the following keys from the letter and return them in a JSON file where the values
corresponding to each key should be stored as a list, even if there is only a single value for a
key:
- letter_title: Title of the letter.
- sender_persons: Name(s) of the person(s) who wrote the letter.
- send_date: The exact or approximate date the letter was written.
- receiver_persons: Name(s) of the person(s) who received the letter.
Take a deep breath and think step by step about how to best accomplish this goal. Map out all the claims
and implications on a virtual whiteboard in your mind. Do not use OCR. Use the ISO format YYYY-MM-DD for
dates. If a piece of information is not included in the letter, set the value for the corresponding key
to "null". Do not return anything except the JSON file.
EXAMPLE:
{
"letter_title": ["Petition for Environmental Protection"],
"send_date": ["1993-03-12"],
"sender_persons": ["Lisa Simpson"],
"receiver_persons": ["Mayor Joe Quimby", "Seymour Skinner"],
}
OUTPUT:
no valid result
| Fuzzy Score | F1 micro / macro | Micro precision/recall | Tue/False Positives | |||||
| n/a | 0.47 | 0.52 | n/a | n/a | n/a | n/a | n/a | n/a |
| Micro Precision | Micro Recall | Instances | TP | FP | FN | |||
| Pricing Date: n/a, n/a. | Tokens: 27.0K IT + 2.1K OT = 29.1K TT | Cost: 0.008$ + 0.005$ = 0.013$ |