RISE Humanities Data Benchmark, 0.4.0

Contribute

Overview

We welcome contributions to the RISE Humanities Data Benchmark.
This page explains how you can create your own benchmark dataset, test it locally, and submit it for inclusion in the official repository.

This guide walks you through the complete contribution workflow:

  • Try it yourself
    Create a benchmark locally using our CLI helper tools.
  • Complete the Checklist
    Ensure your benchmark meets the requirements for inclusion.
  • Submit your Benchmark
    Prepare a pull request to the official repository.

Whether you want to contribute a new dataset, enrich an existing one, or experiment with the evaluation logic, this guide will help you get started.

 

Try it yourself

Fork the Repository

To contribute, you should first create your own fork:

  1. Visit the project:
    https://github.com/RISE-UNIBAS/humanities_data_benchmark
  2. Click “Fork” (top-right corner).
  3. Clone your fork locally:

git clone https://github.com/<your-username>/humanities_data_benchmark.git
cd humanities_data_benchmark

Forking ensures you can work independently without affecting the main project.

 

Create a Benchmark

The repository includes a CLI helper to generate the entire directory structure for a new benchmark.

From the project root, run:

python scripts/create_benchmark.py

 

Add Context Data

Context data are the inputs that will be sent to the LLM. Depending on the benchmark, this may include:

  • .txt, .json and other text-only files (historical texts, metadata records, descriptions, OCR fragments)
  • .jpg, .png or other image types (manuscript pages, document snippets, photos)

Naming convention:

  • The whole filename without its ending is treated as context object name
  • This means that all files with the same basename in the context directories (images, texts) are sent at the same time
  • For each basename you must provide a ground truth file

Example:

benchmarks/<benchmark_name>/images/
 +--page_1.jpg
 +--page_2.jpg
 +--page_3.jpg
benchmarks/<benchmark_name>/texts/
 +--page_1.txt
 +--page_3.json
benchmarks/<benchmark_name>/ground_truths/
 +--page_1.json
 +--page_2.json
 +--page_3.json

This will result in three requests. Objects 'page_'1 will send an image and a txt file, 'page_2' will only send an image and 'page_3' will send an image and a .json file.

 

Add Ground Truth Files

Ground truth files belong in:

ground_truths/page_1.json
ground_truths/page_2.json
ground_truths/page_3.json


Each JSON file should faithfully encode the expected output for the benchmark and follow the schema defined in the README template.

A ground truth must be:

  • Manually checked
  • Unambiguous
  • As complete as possible
  • Internally consistent with your scoring logic

 

Implement Scoring Logic

Each benchmark has a corresponding benchmark class in benchmark.py. Two methods have to be implemented in order for the scoring to work:

Implement the scoring of a single object/request:

Implement the scoring for a single request. Most of the times it is not as easy as to ask if the llm-response and the ground truth are equal. 

Return at least one metric. Commonly used metrics are fuzzy, f1_score, cer

def score_request_answer(self, object_name, response, ground_truth):
   # object_name: basename of the processed files
   # response: large language model response
   # ground_truth: corresponding_ground_truth
   
   calculated_score = 0
   # implement scoring for one object
   
   return {"fuzzy": calculated_score}

 

Implement the scoring of the whole test run:

Take the average or the mean or use any other functionality to score accross all requests for the test run.

def score_benchmark(self, all_scores):
       total_score = 0
       for score in all_scores:
           total_score += score['fuzzy']
       return {"fuzzy": total_score / len(all_scores)}

Examples of existing scoring logic can be found in other benchmarks in the repository.

 

Run Local Tests

A second CLI tool allows you to run tests on your benchmark locally:

python scripts/run_single_test.py --adhoc


This:

  • executes the prompt(s)
  • collects the model responses
  • applies your scoring logic
  • outputs a score summary

You should ensure the benchmark runs end-to-end without errors.

 

Complete the Checklist

Before submitting a pull request, please make sure your benchmark meets all of the following criteria:

Data Requirements

  • Dataset is not too large
    Recommended: < 50 MB total
  • Ground truths are manually checked and reliable
  • Data is legally usable
    • Must be openly available
    • No copyrighted or sensitive data
    • A clear license is indicated (preferably open license)
  • Context files are properly named and paired with ground_truths

Technical Requirements

  • Benchmark runs locally without errors
  • Scoring metric is clearly defined and documented
  • Directory structure follows the template
  • README.md inside the benchmark folder is fully completed
    (template is created automatically by create_benchmark.py)

Quality Requirements

  • Outputs are deterministic enough for fair evaluation
  • The benchmark fills a clear research gap (new task, domain, or corpus)
  • Instructions do not bias the LLM toward “right answers” via over-specification

 

Submit your benchmark

Once everything is ready, you can submit your contribution.

Step 1 — Push Your Changes

Push your benchmark to your fork:

git add benchmarks/your_benchmark_name
git commit -m "Add new benchmark: your_benchmark_name"
git push origin main

(Or push a feature branch if you prefer.)

 

Step 2 — Open a Pull Request

  • Go to your fork on GitHub.
  • Click “Compare & pull request”.
  • Target:

RISE-UNIBAS/humanities_data_benchmark → main

Add a short description:

  • What your benchmark tests
  • Example ground truth formats
  • Scoring logic summary
  • How you validated the dataset
  • Any remaining issues or questions

 

Step 3 — Internal Review

The maintainers will:

  • run the benchmark locally
  • check the metric
  • confirm licensing
  • validate folder structure
  • potentially request revisions

Once everything is green, your benchmark will be merged into the main repository.