RISE Humanities Data Benchmark, 0.5.1

Search Test Runs

 

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

  • Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
  • Model configuration – provider, model version, temperature, and other generation parameters.
  • Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
  • Usage and cost data – token counts and calculated API costs.
  • Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Search Results

Your search for Benchmark 'book_advert_xml__true' with Search Hidden 'False' returned 104 results, showing page 1 of 11.
Result 1 of 104

Test T1047 at 2026-04-27

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenai
Modelgpt-5.5-2026-04-23
  
Temperature1.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
98.48 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 33.6K IT + 62.5K OT = 96.1K TTCost: 0.168$1.875$2.043$
Result 2 of 104

Test T1035 at 2026-04-24

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Providerdeepseek
Modeldeepseek-v4-flash
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
97.01 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.4K IT + 209.8K OT = 249.3K TTCost: 0.001$0.059$0.060$
Result 3 of 104

Test T1036 at 2026-04-24

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Providerdeepseek
Modeldeepseek-v4-pro
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
97.42 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.4K IT + 412.5K OT = 452.0K TTCost: 0.006$1.436$1.441$
Result 4 of 104

Test T0993 at 2026-04-22

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-9b-20260310
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
18.80 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 35.1K IT + 2.0M OT = 2.1M TTCost: 0.004$0.303$0.306$
Result 5 of 104

Test T0967 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-plus-20260216
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
95.08 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.2K IT + 428.4K OT = 467.6K TTCost: 0.010$0.668$0.679$
Result 6 of 104

Test T0941 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-35b-a3b-20260224
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
59.95 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 37.2K IT + 1.7M OT = 1.8M TTCost: 0.006$2.270$2.277$
Result 7 of 104

Test T0954 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-397b-a17b-20260216
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
93.84 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 32.2K IT + 410.9K OT = 443.0K TTCost: 0.013$0.961$0.974$
Result 8 of 104

Test T1006 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelgoogle/gemma-4-26b-a4b-it-20260403
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
98.21 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 33.0K IT + 33.7K OT = 66.7K TTCost: 0.003$0.012$0.014$
Result 9 of 104

Test T1032 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideranthropic
Modelclaude-opus-4-7
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
97.62 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 101.4K IT + 50.8K OT = 152.2K TTCost: 0.507$1.270$1.777$
Result 10 of 104

Test T0928 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-27b-20260224
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
95.90 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.2K IT + 1.3M OT = 1.4M TTCost: 0.008$2.066$2.074$