News

17. June 2026

New "Which Model for Me?" tab: a guided model recommender

The benchmark dashboard now includes a "Which Model for Me?" tab that turns our benchmark results into a concrete recommendation for your own project, so you don't have to read every score table yourself.

Instead of asking you to know which benchmark matches your material, it asks about your data and task. You describe what you're working with (document type, handwriting vs. print, language, the task such as transcription, extraction or layout analysis, and century), and the tool maps your answers to the closest benchmarks in the suite. You can fine-tune the matched benchmarks, or switch to expert mode and pick them directly.

Three further inputs shape the result:

Scale: from a one-off run to a 100k+ document pipeline. The larger your scale, the more score the tool is willing to trade for lower cost or faster throughput.
Optimize for: cost or time. With run data loaded, recommendations are ranked by price-per-run or seconds-per-run within a score band, rather than by raw score alone.
Data sensitivity: a simple choice between non-sensitive and sensitive data. Non-sensitive data may use cloud-hosted models; sensitive data restricts the recommendation to on-premise models (e.g., hosted by sciCORE) that keep your documents on site. Always follow your own institution's data-classification directive.

The output is a single recommended model with its per-benchmark scores, cost and speed where available, and a clear privacy note, plus a short list of alternatives. Each score is a normalized 0–100 value that puts every benchmark's native metric (F1, character error rate, and so on) on a common scale, so results from very different tasks stay comparable. Scores always reflect the latest non-legacy run for each model, so recommendations track the current generation rather than retired configurations.

As always, the underlying data is open and the methodology is documented in the RISE Humanities Data Benchmark repository.

27. March 2026

The Journal of Open Humanities Data (JOHD) published two papers by Maximilian Hindermann, Sorin Marti, Lea Katharina Kasper and Arno Bosse on the RISE Humanities Data Benchmark. Which large language models perform best on humanities research tasks, and how can we systematically compare their capabilities?

The data paper “The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks” presents a framework for assessing the performance of large language models on humanities-related tasks. The benchmark suite (available on GitHub) includes text and image datasets, prompts, ground truths, and evaluation scripts and addresses tasks essential to digital humanities work including document analysis, transcription, and metadata extraction from historical materials.

The discussion paper “From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark” traces how the suite emerged from RISE's consulting practice and reflects on the methodological challenges of applying benchmarking to humanities contexts. It argues that ground truth in humanities benchmarking is not a matter of objective correctness but of explicit, scholar-defined interpretive choices, and that benchmarking should therefore be understood as an epistemic practice rather than a neutral measurement.

Both papers contribute to the JOHD special collection “Benchmarking in Digital Humanities”, which aims to establish benchmarking as common practice in the humanities. The framework promotes evidence-based decision making on which models to use for specific tasks and provides quantifiable comparisons between different LLMs, via an interactive dashboard.

Researchers interested in using the benchmark with their own materials are welcome to get in touch. In their roles at RISE, Maximilian Hindermann, Sorin Marti and Arno Bosse advise researchers on the use of computational methods and large language models for humanities research projects and are happy to discuss how the framework can be applied to new data and research contexts.

Citations:

Hindermann, M., Marti, S., Kasper, L. K., & Bosse, A. (2026). The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks. Journal of Open Humanities Data, 12(1), 24. https://doi.org/10.5334/johd.481

Hindermann, M., Kasper, L. K., Marti, S., & Bosse, A. (2026). From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark. Journal of Open Humanities Data, 12(1), 38. https://doi.org/10.5334/johd.470

Source: https://rise.unibas.ch/en/news/details/two-new-papers-on-llm-benchmarking-for-humanities-tasks/

08. December 2025

The RISE Humanities Data Benchmark Platform Is Now Online

We are pleased to announce the public launch of the RISE Humanities Data Benchmark, a new research infrastructure designed to support systematic, transparent, and reproducible evaluation of large language models on humanities-oriented tasks.

The platform brings together a growing suite of benchmark datasets derived from historical documents, bibliographic sources, index cards, and other forms of cultural heritage material. Each benchmark includes detailed contextual information, clearly defined ground truth, and openly documented evaluation procedures. Together, these resources provide an evidence-based foundation for assessing how well current models handle the complex, data-rich challenges commonly encountered in humanities research.

In addition to offering full access to all benchmark descriptions, the platform provides:

A comprehensive leaderboard, enabling cross-model and cross-provider comparison
Interactive visualisations of accuracy, consistency, efficiency, and cost
Searchable test-run archives with detailed outputs, scoring metrics, and configurations
Guidance and tooling for creating and contributing new benchmarks
Documentation to support transparency and reuse

With this launch, our goal is to create a shared, extensible environment that facilitates rigorous evaluation, fosters methodological discussion, and encourages community contributions. We invite researchers, practitioners, and institutions to explore the platform, reuse our benchmark setups, and experiment with their own datasets.

The RISE Humanities Data Benchmark will continue to evolve as new benchmarks, evaluation methods, and model providers are added. We welcome your feedback and look forward to collaborative development in the coming months.