RISE Humanities Data Benchmark

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
Model configuration – provider, model version, temperature, and other generation parameters.
Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
Usage and cost data – token counts and calculated API costs.
Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Result 21 of 97

Test T1060 at 2026-05-22

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	genai
Model	gemini-3.5-flash

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
80.61	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 31.5K IT + 515.5K OT = 547.0K TT

Cost: 0.047$ + 4.639$ = 4.687$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 22 of 97

Test T1047 at 2026-04-27

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openai
Model	gpt-5.5-2026-04-23

Temperature	1.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
98.48	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 33.6K IT + 62.5K OT = 96.1K TT

Cost: 0.168$ + 1.875$ = 2.043$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 23 of 97

Test T1036 at 2026-04-24

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	deepseek
Model	deepseek-v4-pro

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
97.42	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.4K IT + 412.5K OT = 452.0K TT

Cost: 0.006$ + 1.436$ = 1.441$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 24 of 97

Test T1035 at 2026-04-24

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	deepseek
Model	deepseek-v4-flash

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
97.01	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.4K IT + 209.8K OT = 249.3K TT

Cost: 0.001$ + 0.059$ = 0.060$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 25 of 97

Test T0993 at 2026-04-22

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-9b-20260310

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
18.80	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 35.1K IT + 2.0M OT = 2.1M TT

Cost: 0.004$ + 0.303$ = 0.306$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 26 of 97

Test T0902 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.6-plus-04-02

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
93.52	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 38.5K IT + 572.4K OT = 610.8K TT

Cost: 0.013$ + 1.116$ = 1.129$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 27 of 97

Test T0980 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-flash-20260224

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
69.35	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 2.4M OT = 2.4M TT

Cost: 0.003$ + 0.619$ = 0.622$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 28 of 97

Test T0967 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-plus-20260216

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
95.08	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 428.4K OT = 467.6K TT

Cost: 0.010$ + 0.668$ = 0.679$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 29 of 97

Test T1006 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	google/gemma-4-26b-a4b-it-20260403

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
98.21	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 33.0K IT + 33.7K OT = 66.7K TT

Cost: 0.003$ + 0.012$ = 0.014$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 30 of 97

Test T0941 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openrouter
Model	qwen/qwen3.5-35b-a3b-20260224

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
59.95	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 37.2K IT + 1.7M OT = 1.8M TT

Cost: 0.006$ + 2.270$ = 2.277$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Search Test Runs

Search Results
Show compact results Refine Search New Search

Download JSON Download CSV

Test T1060 at 2026-05-22

Test T1047 at 2026-04-27

Test T1036 at 2026-04-24

Test T1035 at 2026-04-24

Test T0993 at 2026-04-22

Test T0902 at 2026-04-21

Test T0980 at 2026-04-21

Test T0967 at 2026-04-21

Test T1006 at 2026-04-21

Test T0941 at 2026-04-21

Search Test Runs

Search Results Show compact results Refine Search New Search Download Download JSON Download CSV

Test T1060 at 2026-05-22

Test T1047 at 2026-04-27

Test T1036 at 2026-04-24

Test T1035 at 2026-04-24

Test T0993 at 2026-04-22

Test T0902 at 2026-04-21

Test T0980 at 2026-04-21

Test T0967 at 2026-04-21

Test T1006 at 2026-04-21

Test T0941 at 2026-04-21

Search Results
Show compact results Refine Search New Search

Download JSON Download CSV