A test run is a single execution of a benchmark test using a defined model configuration.
Each run represents how a particular large language model (LLM) — such as GPT-4, Claude-3, or Gemini — performed on a given task at a specific time, with specific settings.
A test run includes:
Together, test runs make it possible to compare models, providers, and configurations across benchmarks in a transparent and reproducible way.
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}
{'document-type': ['manuscript'], 'writing': ['handwritten'], 'century': [15], 'language': ['de'], 'layout': ['prose'], 'task': ['transcription']}