RISE Humanities Data Benchmark

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
Model configuration – provider, model version, temperature, and other generation parameters.
Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
Usage and cost data – token counts and calculated API costs.
Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Result 41 of 97

Test T0850 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-27b

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
95.84	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 413.5K OT = 452.7K TT

Cost: 0.012$ + 0.992$ = 1.004$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 42 of 97

Test T0837 at 2026-03-25

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-35b-a3b

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
92.93	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 442.2K OT = 481.4K TT

Cost: 0.010$ + 0.884$ = 0.894$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 43 of 97

Test T0824 at 2026-03-24

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	alibaba
Model	qwen3.5-plus

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
95.80	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 39.2K IT + 432.7K OT = 472.0K TT

Cost: 0.016$ + 1.038$ = 1.054$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 44 of 97

Test T0728 at 2026-03-23

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	x-ai
Model	grok-4.20-0309-reasoning

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
98.61	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 40.4K IT + 26.7K OT = 67.0K TT

Cost: 0.081$ + 0.160$ = 0.241$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 45 of 97

Test T0703 at 2026-03-23

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openai
Model	gpt-5.3-codex

Temperature	1.0
Dataclass	not set (=no auto-parsed result)

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
97.35	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 27.9K IT + 29.7K OT = 57.7K TT

Cost: 0.049$ + 0.416$ = 0.465$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 46 of 97

Test T0703 at 2026-03-17

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openai
Model	gpt-5.3-codex

Temperature	1.0
Dataclass	not set (=no auto-parsed result)

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
97.12	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 27.9K IT + 30.6K OT = 58.5K TT

Cost: 0.049$ + 0.428$ = 0.477$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 47 of 97

Test T0638 at 2026-03-16

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	anthropic
Model	claude-opus-4-6

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
97.46	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 74.1K IT + 41.3K OT = 115.4K TT

Cost: 0.371$ + 1.032$ = 1.403$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 48 of 97

Test T0691 at 2026-03-16

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	genai
Model	gemini-3.1-pro-preview

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
94.28	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 31.5K IT + 31.9K OT = 63.5K TT

Cost: 0.063$ + 0.383$ = 0.446$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 49 of 97

Test T0650 at 2026-03-16

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	anthropic
Model	claude-sonnet-4-6

Temperature	0.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
96.55	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 74.1K IT + 46.3K OT = 120.4K TT

Cost: 0.222$ + 0.695$ = 0.917$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Result 50 of 97

Test T0662 at 2026-03-16

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration

Provider	openai
Model	gpt-5.4-2026-03-05

Temperature	1.0
Dataclass	CorrectedAdvert

Normalized Score	100.00 %
Test time	unknown seconds

Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring

Fuzzy Score	F1 micro / macro		Micro precision/recall		Tue/False Positives
96.78	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
			Micro Precision	Micro Recall	Instances	TP	FP	FN

Costs / Pricing

Pricing Date: n/a, n/a.

Tokens: 33.6K IT + 26.2K OT = 59.8K TT

Cost: 0.084$ + 0.393$ = 0.477$

Cite: Hindermann, Marti, Alberto, et al., (2025). RISE-UNIBAS/humanities_data_benchmark, 10.5281/zenodo.16941752

Search Test Runs

Search Results
Show compact results Refine Search New Search

Download JSON Download CSV

Test T0850 at 2026-03-25

Test T0837 at 2026-03-25

Test T0824 at 2026-03-24

Test T0728 at 2026-03-23

Test T0703 at 2026-03-23

Test T0703 at 2026-03-17

Test T0638 at 2026-03-16

Test T0691 at 2026-03-16

Test T0650 at 2026-03-16

Test T0662 at 2026-03-16

Search Test Runs

Search Results Show compact results Refine Search New Search Download Download JSON Download CSV

Test T0850 at 2026-03-25

Test T0837 at 2026-03-25

Test T0824 at 2026-03-24

Test T0728 at 2026-03-23

Test T0703 at 2026-03-23

Test T0703 at 2026-03-17

Test T0638 at 2026-03-16

Test T0691 at 2026-03-16

Test T0650 at 2026-03-16

Test T0662 at 2026-03-16

Search Results
Show compact results Refine Search New Search

Download JSON Download CSV