Contribute

Overview

We welcome contributions to the RISE Humanities Data Benchmark.
This page explains how you can create your own benchmark dataset, test it locally, and submit it for inclusion in the official repository.

This guide walks you through the complete contribution workflow:

Try it yourself
Create a benchmark locally using our CLI helper tools.
Complete the Checklist
Ensure your benchmark meets the requirements for inclusion.
Submit your Benchmark
Prepare a pull request to the official repository.

Whether you want to contribute a new dataset, enrich an existing one, or experiment with the evaluation logic, this guide will help you get started.

Overview
Try it yourself
Complete the Checklist
Submit your benchmark

Try it yourself

Fork the Repository

To contribute, you should first create your own fork:

Visit the project:
https://github.com/RISE-UNIBAS/humanities_data_benchmark
Click “Fork” (top-right corner).
Clone your fork locally:

git clone https://github.com/<your-username>/humanities_data_benchmark.git
cd humanities_data_benchmark

Forking ensures you can work independently without affecting the main project.

Create a Benchmark

The repository includes a CLI helper to generate the entire directory structure for a new benchmark.

From the project root, run:

python scripts/create_benchmark.py

Add Context Data

Context data are the inputs that will be sent to the LLM. Depending on the benchmark, this may include:

.txt, .json and other text-only files (historical texts, metadata records, descriptions, OCR fragments)
.jpg, .png or other image types (manuscript pages, document snippets, photos)

Naming convention:

The whole filename without its ending is treated as context object name
This means that all files with the same basename in the context directories (images, texts) are sent at the same time
For each basename you must provide a ground truth file

Example:

benchmarks/<benchmark_name>/images/
 +--page_1.jpg
 +--page_2.jpg
 +--page_3.jpg
benchmarks/<benchmark_name>/texts/
 +--page_1.txt
 +--page_3.json
benchmarks/<benchmark_name>/ground_truths/
 +--page_1.json
 +--page_2.json
 +--page_3.json

This will result in three requests. Objects 'page_'1 will send an image and a txt file, 'page_2' will only send an image and 'page_3' will send an image and a .json file.

Add Ground Truth Files

Ground truth files belong in:

ground_truths/page_1.json
ground_truths/page_2.json
ground_truths/page_3.json

Each JSON file should faithfully encode the expected output for the benchmark and follow the schema defined in the README template.

A ground truth must be:

Manually checked
Unambiguous
As complete as possible
Internally consistent with your scoring logic

Implement Scoring Logic

Each benchmark has a corresponding benchmark class in benchmark.py. Two methods have to be implemented in order for the scoring to work:

Implement the scoring of a single object/request:

Implement the scoring for a single request. Most of the times it is not as easy as to ask if the llm-response and the ground truth are equal.

Return at least one metric. Commonly used metrics are fuzzy, f1_score, cer

def score_request_answer(self, object_name, response, ground_truth):
   # object_name: basename of the processed files
   # response: large language model response
   # ground_truth: corresponding_ground_truth
   
   calculated_score = 0
   # implement scoring for one object
   
   return {"fuzzy": calculated_score}

Implement the scoring of the whole test run:

Take the average or the mean or use any other functionality to score accross all requests for the test run.

def score_benchmark(self, all_scores):
       total_score = 0
       for score in all_scores:
           total_score += score['fuzzy']
       return {"fuzzy": total_score / len(all_scores)}

Examples of existing scoring logic can be found in other benchmarks in the repository.

Run Local Tests

A second CLI tool allows you to run tests on your benchmark locally:

python scripts/run_single_test.py --adhoc

This:

executes the prompt(s)
collects the model responses
applies your scoring logic
outputs a score summary

You should ensure the benchmark runs end-to-end without errors.

Complete the Checklist

Before submitting a pull request, please make sure your benchmark meets all of the following criteria:

Data Requirements

Dataset is not too large
Recommended: < 50 MB total
Ground truths are manually checked and reliable
Data is legally usable
- Must be openly available
- No copyrighted or sensitive data
- A clear license is indicated (preferably open license)
Context files are properly named and paired with ground_truths

Technical Requirements

Benchmark runs locally without errors
Scoring metric is clearly defined and documented
Directory structure follows the template
README.md inside the benchmark folder is fully completed
(template is created automatically by create_benchmark.py)

Quality Requirements

Outputs are deterministic enough for fair evaluation
The benchmark fills a clear research gap (new task, domain, or corpus)
Instructions do not bias the LLM toward “right answers” via over-specification

Submit your benchmark

Once everything is ready, you can submit your contribution.

Step 1 — Push Your Changes

Push your benchmark to your fork:

git add benchmarks/your_benchmark_name
git commit -m "Add new benchmark: your_benchmark_name"
git push origin main

(Or push a feature branch if you prefer.)

Step 2 — Open a Pull Request

Go to your fork on GitHub.
Click “Compare & pull request”.
Target:

RISE-UNIBAS/humanities_data_benchmark → main

Add a short description:

What your benchmark tests
Example ground truth formats
Scoring logic summary
How you validated the dataset
Any remaining issues or questions

Step 3 — Internal Review

The maintainers will:

run the benchmark locally
check the metric
confirm licensing
validate folder structure
potentially request revisions

Once everything is green, your benchmark will be merged into the main repository.

Contribute

Overview

Table of Contents

Try it yourself

Complete the Checklist

Submit your benchmark