Our benchmarking framework currently consists of 11 datasets and 2 test sets. There are 811 test configurations. Each of this configurations sends a number of tasks to an LLM, and saves and scores the answer. Configurations have been run 1339 which translates to 60896 LLM requests. The datasets are rather small and consist of a total of 597 input files. For each of these files, there is a ground truth which is used to score the answers of the 65 models from 10 providers.
The first test was run on 2025-03-01 and the latest one on 2026-03-24. This page was last updated on 2026-03-25. All the requests together took 1633792 to complete, we spent about $453.48 on it.
Measuring energy/Co2 impact is difficult and the following numbers are rough estimates. Our tests used between 34846.5 and 698757.6 Wh. This produced between 3.485 and 559.006 Co2. To show the vast range of these assumed values: that energy allows for a car journey from between 14.02 and 2249.07 kilometers.