Dataset Description Result Overview Test Runs
This benchmark has been run 111 times. It uses fuzzy metric.
Tested providers: x-ai, openrouter, genai, openai, anthropic, deepseek, mistral, alibaba, scicore
Tested models: deepseek-v4-pro, claude-haiku-4-5-20251001, google/gemma-4-26b-a4b-it, qwen3.5-397b-a17b, qwen/qwen3.5-122b-a10b, gemini-2.0-flash, claude-opus-4-6, mistral-large-2411, qwen3.5-35b-a3b, claude-sonnet-4-5-20250929, mistral-small-2506, claude-3-7-sonnet-20250219, grok-4.20-0309-reasoning, gpt-5-nano, x-ai/grok-4, qwen3-235b-fp8, mistral-medium-2505, gemini-2.0-flash-lite, qwen/qwen3.5-flash-02-23, claude-opus-4-20250514, gemini-2.5-flash-lite, o3, meta-llama/llama-4-maverick, gpt-4o-mini, qwen/qwen3-vl-8b-instruct, gemini-2.5-flash-preview-09-2025, ministral-14b-2512, mistral-medium-2508, gemini-2.5-flash-lite-preview-09-2025, qwen3.5-122b-a10b, pixtral-12b, claude-opus-4-5-20251101, magistral-medium-2509, claude-opus-4-1-20250805, gpt-5.5-2026-04-23, gpt-5.4-2026-03-05, gemini-3.1-pro-preview, gpt-5, deepseek-chat, gemini-2.5-pro, gpt-5.2-2025-12-11, qwen3.5-27b, qwen/qwen3.5-27b, claude-sonnet-4-20250514, gemini-3.1-flash-lite-preview, qwen/qwen3-vl-8b-thinking, gpt-5-mini, claude-opus-4-7, gpt-4.1, magistral-small-2509, qwen3.5-flash-2026-02-23, gemini-2.5-flash, qwen/qwen3.5-35b-a3b, qwen/qwen3.5-397b-a17b, deepseek-reasoner, gpt-4.1-mini, qwen/qwen3-vl-30b-a3b-instruct, qwen3.5-plus-2026-02-15, qwen/qwen3.5-9b, claude-sonnet-4-6, deepseek-v4-flash, qwen/qwen3.6-plus, gpt-4.1-nano, gpt-4o, gemini-3-flash-preview, pixtral-large-2411, mistral-large-2512, ministral-8b-2512, gpt-5.3-codex, GLM-4.5V-FP8, google/gemma-4-31b-it, gemini-3-pro-preview, claude-3-opus-20240229, qwen/qwen3.5-plus-02-15, gpt-5.1-2025-11-13
| Score | Date | Provider | Model |
|---|---|---|---|
| 98.48 | 3 weeks ago | openai | gpt-5.5-2026-04-23 |
| 97.42 | 3 weeks ago | deepseek | deepseek-v4-pro |
| 97.01 | 3 weeks ago | deepseek | deepseek-v4-flash |
| 18.80 | 4 weeks ago | openrouter | qwen/qwen3.5-9b |
| 92.01 | 1 month ago | openrouter | qwen/qwen3.5-122b-a10b |
| Role | Contributors |
|---|---|
| Domain expert | Ina Serif |
| Data curator | Sorin Marti |