RISE Humanities Data Benchmark, 0.5.2-pre1

Search Test Runs

 

A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.

A test run includes:

  • Prompt and role definition – what the model was asked to do and from what perspective (e.g. “as a historian”).
  • Model configuration – provider, model version, temperature, and other generation parameters.
  • Results – the model’s actual response and its evaluation (scores such as F1 or accuracy).
  • Usage and cost data – token counts and calculated API costs.
  • Metadata – information like the test date, benchmark name, and person who executed it.

Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.

Search Results

Your search for Benchmark 'book_advert_xml__true' with Search Hidden 'False' returned 120 results, showing page 3 of 12.
Result 21 of 120

Test T0954 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-397b-a17b-20260216
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
93.84 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 32.2K IT + 410.9K OT = 443.0K TTCost: 0.013$0.961$0.974$
Result 22 of 120

Test T0941 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-35b-a3b-20260224
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
59.95 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 37.2K IT + 1.7M OT = 1.8M TTCost: 0.006$2.270$2.277$
Result 23 of 120

Test T1019 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelgoogle/gemma-4-31b-it-20260402
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
96.32 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 33.3K IT + 33.9K OT = 67.3K TTCost: 0.004$0.013$0.017$
Result 24 of 120

Test T0928 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-27b-20260224
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
95.90 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.2K IT + 1.3M OT = 1.4M TTCost: 0.008$2.066$2.074$
Result 25 of 120

Test T0967 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-plus-20260216
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
95.08 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.2K IT + 428.4K OT = 467.6K TTCost: 0.010$0.668$0.679$
Result 26 of 120

Test T1032 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideranthropic
Modelclaude-opus-4-7
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
97.62 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 101.4K IT + 50.8K OT = 152.2K TTCost: 0.507$1.270$1.777$
Result 27 of 120

Test T1006 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelgoogle/gemma-4-26b-a4b-it-20260403
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
98.21 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 33.0K IT + 33.7K OT = 66.7K TTCost: 0.003$0.012$0.014$
Result 28 of 120

Test T0902 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.6-plus-04-02
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
93.52 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 38.5K IT + 572.4K OT = 610.8K TTCost: 0.013$1.116$1.129$
Result 29 of 120

Test T0980 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-flash-20260224
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
69.35 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 39.2K IT + 2.4M OT = 2.4M TTCost: 0.003$0.619$0.622$
Result 30 of 120

Test T0915 at 2026-04-21

{'document-type': ['newspaper-page'], 'century': [18], 'language': ['en'], 'task': ['data-correction']}

Configuration
Provideropenrouter
Modelqwen/qwen3.5-122b-a10b-20260224
  
Temperature0.0
DataclassCorrectedAdvert
  
Normalized Score100.00 %
Test timeunknown seconds
Prompt

Fix this xml. Add xml-tags if faulty where it makes sense.
Format your response as JSON. Use the keys 'fixed_xml', 'number_of_fixes', 'explanation'.

Results

no valid result

Scoring
Fuzzy Score F1 micro / macro Micro precision/recall Tue/False Positives
92.01 n/a n/a n/a n/a n/a n/a n/a n/a
      Micro Precision Micro Recall Instances TP FP FN
Costs / Pricing
Pricing Date: n/an/aTokens: 38.9K IT + 1.6M OT = 1.7M TTCost: 0.010$3.369$3.379$