We welcome contributions to the RISE Humanities Data Benchmark.
This page explains how you can create your own benchmark dataset, test it locally, and submit it for inclusion in the official repository.
This guide walks you through the complete contribution workflow:
Whether you want to contribute a new dataset, enrich an existing one, or experiment with the evaluation logic, this guide will help you get started.
Fork the Repository
To contribute, you should first create your own fork:
git clone https://github.com/<your-username>/humanities_data_benchmark.git
cd humanities_data_benchmarkForking ensures you can work independently without affecting the main project.
Create a Benchmark
The repository includes a CLI helper to generate the entire directory structure for a new benchmark.
From the project root, run:
python scripts/create_benchmark.py
Add Context Data
Context data are the inputs that will be sent to the LLM. Depending on the benchmark, this may include:
Naming convention:
Example:
benchmarks/<benchmark_name>/images/
+--page_1.jpg
+--page_2.jpg
+--page_3.jpg
benchmarks/<benchmark_name>/texts/
+--page_1.txt
+--page_3.json
benchmarks/<benchmark_name>/ground_truths/
+--page_1.json
+--page_2.json
+--page_3.jsonThis will result in three requests. Objects 'page_'1 will send an image and a txt file, 'page_2' will only send an image and 'page_3' will send an image and a .json file.
Add Ground Truth Files
Ground truth files belong in:
ground_truths/page_1.json
ground_truths/page_2.json
ground_truths/page_3.json
Each JSON file should faithfully encode the expected output for the benchmark and follow the schema defined in the README template.
A ground truth must be:
Implement Scoring Logic
Each benchmark has a corresponding benchmark class in benchmark.py. Two methods have to be implemented in order for the scoring to work:
Implement the scoring of a single object/request:
Implement the scoring for a single request. Most of the times it is not as easy as to ask if the llm-response and the ground truth are equal.
Return at least one metric. Commonly used metrics are fuzzy, f1_score, cer
def score_request_answer(self, object_name, response, ground_truth):
# object_name: basename of the processed files
# response: large language model response
# ground_truth: corresponding_ground_truth
calculated_score = 0
# implement scoring for one object
return {"fuzzy": calculated_score}
Implement the scoring of the whole test run:
Take the average or the mean or use any other functionality to score accross all requests for the test run.
def score_benchmark(self, all_scores):
total_score = 0
for score in all_scores:
total_score += score['fuzzy']
return {"fuzzy": total_score / len(all_scores)}Examples of existing scoring logic can be found in other benchmarks in the repository.
Run Local Tests
A second CLI tool allows you to run tests on your benchmark locally:
python scripts/run_single_test.py --adhoc
This:
You should ensure the benchmark runs end-to-end without errors.
Before submitting a pull request, please make sure your benchmark meets all of the following criteria:
Data Requirements
Technical Requirements
Quality Requirements
Once everything is ready, you can submit your contribution.
Step 1 — Push Your Changes
Push your benchmark to your fork:
git add benchmarks/your_benchmark_name
git commit -m "Add new benchmark: your_benchmark_name"
git push origin main(Or push a feature branch if you prefer.)
Step 2 — Open a Pull Request
RISE-UNIBAS/humanities_data_benchmark → main
Step 3 — Internal Review
The maintainers will:
Once everything is green, your benchmark will be merged into the main repository.