RISE Humanities Data Benchmark, 0.5.0-pre1

News

27. March 2025

The Journal of Open Humanities Data (JOHD) published two papers by Maximilian Hindermann, Sorin Marti, Lea Katharina Kasper and Arno Bosse on the RISE Humanities Data Benchmark. Which large language models perform best on humanities research tasks, and how can we systematically compare their capabilities?

The data paper “The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks” presents a framework for assessing the performance of large language models on humanities-related tasks. The benchmark suite (available on GitHub) includes text and image datasets, prompts, ground truths, and evaluation scripts and addresses tasks essential to digital humanities work including document analysis, transcription, and metadata extraction from historical materials.

The discussion paper “From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark” traces how the suite emerged from RISE's consulting practice and reflects on the methodological challenges of applying benchmarking to humanities contexts. It argues that ground truth in humanities benchmarking is not a matter of objective correctness but of explicit, scholar-defined interpretive choices, and that benchmarking should therefore be understood as an epistemic practice rather than a neutral measurement.

Both papers contribute to the JOHD special collection “Benchmarking in Digital Humanities”, which aims to establish benchmarking as common practice in the humanities. The framework promotes evidence-based decision making on which models to use for specific tasks and provides quantifiable comparisons between different LLMs, via an interactive dashboard.

Researchers interested in using the benchmark with their own materials are welcome to get in touch. In their roles at RISE, Maximilian HindermannSorin Marti and Arno Bosse advise researchers on the use of computational methods and large language models for humanities research projects and are happy to discuss how the framework can be applied to new data and research contexts.

Citations:

Hindermann, M., Marti, S., Kasper, L. K., & Bosse, A. (2026). The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks. Journal of Open Humanities Data12(1), 24. https://doi.org/10.5334/johd.481

Hindermann, M., Kasper, L. K., Marti, S., & Bosse, A. (2026). From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark. Journal of Open Humanities Data12(1), 38. https://doi.org/10.5334/johd.470

Source: https://rise.unibas.ch/en/news/details/two-new-papers-on-llm-benchmarking-for-humanities-tasks/

08. December 2025

The RISE Humanities Data Benchmark Platform Is Now Online

We are pleased to announce the public launch of the RISE Humanities Data Benchmark, a new research infrastructure designed to support systematic, transparent, and reproducible evaluation of large language models on humanities-oriented tasks.

The platform brings together a growing suite of benchmark datasets derived from historical documents, bibliographic sources, index cards, and other forms of cultural heritage material. Each benchmark includes detailed contextual information, clearly defined ground truth, and openly documented evaluation procedures. Together, these resources provide an evidence-based foundation for assessing how well current models handle the complex, data-rich challenges commonly encountered in humanities research.

In addition to offering full access to all benchmark descriptions, the platform provides:

  • A comprehensive leaderboard, enabling cross-model and cross-provider comparison
  • Interactive visualisations of accuracy, consistency, efficiency, and cost
  • Searchable test-run archives with detailed outputs, scoring metrics, and configurations
  • Guidance and tooling for creating and contributing new benchmarks
  • Documentation to support transparency and reuse

With this launch, our goal is to create a shared, extensible environment that facilitates rigorous evaluation, fosters methodological discussion, and encourages community contributions. We invite researchers, practitioners, and institutions to explore the platform, reuse our benchmark setups, and experiment with their own datasets.

The RISE Humanities Data Benchmark will continue to evolve as new benchmarks, evaluation methods, and model providers are added. We welcome your feedback and look forward to collaborative development in the coming months.