Documentation Index
Fetch the complete documentation index at: https://wb-21fd5541-style-guide-support-models-articles-20260527-00.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This page catalogs the evaluation benchmarks available through LLM Evaluation Jobs, organized by category. Use it to discover which benchmarks you can run, identify their task IDs, and check whether a benchmark requires additional credentials.
Some benchmarks require additional credentials. A team admin must add these credentials as team-scoped secrets before any team member can use the benchmarks in an evaluation job:
- If a benchmark has
Yes in the OpenAI Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark that has this requirement, set the Scorer API key field to the secret.
- If a benchmark has a link in the Gated HF Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face. The admin then creates a Hugging Face user access token and configures a team secret with that token. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.
Knowledge
Evaluate factual knowledge across domains such as science, language, and general reasoning.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| BoolQ | boolq | | | Boolean yes/no questions from natural language queries |
| GPQA Diamond | gpqa_diamond | | | Graduate-level science questions (highest quality subset) |
| HLE | hle | | Yes | Human-level evaluation benchmark |
| Lingoly | lingoly | | Yes | Linguistics olympiad problems |
| Lingoly Too | lingoly_too | | Yes | Extended linguistics challenge problems |
| MMIU | mmiu | | | Massive Multitask Language Understanding benchmark |
| MMLU (0-shot) | mmlu_0_shot | | | Massive Multitask Language Understanding without examples |
| MMLU (5-shot) | mmlu_5_shot | | | Massive Multitask Language Understanding with 5 examples |
| MMLU-Pro | mmlu_pro | | | More challenging version of MMLU |
| ONET M6 | onet_m6 | | | Occupational knowledge benchmark |
| PAWS | paws | | | Paraphrase adversarial word substitution |
| SevenLLM MCQ (English) | sevenllm_mcq_en | | | Multiple choice questions in English |
| SevenLLM MCQ (Chinese) | sevenllm_mcq_zh | | | Multiple choice questions in Chinese |
| SevenLLM QA (English) | sevenllm_qa_en | | | Question answering in English |
| SevenLLM QA (Chinese) | sevenllm_qa_zh | | | Question answering in Chinese |
| SimpleQA | simpleqa | Yes | | Straightforward factual question answering |
| SimpleQA Verified | simpleqa_verified | | | Verified subset of SimpleQA with validated answers |
| WorldSense | worldsense | | | Evaluates understanding of world knowledge and common sense |
Reasoning
Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AGIE AQUA-RAT | agie_aqua_rat | | | Algebraic question answering with rationales |
| AGIE LogiQA (English) | agie_logiqa_en | | | Logical reasoning questions in English |
| AGIE LSAT Analytical Reasoning | agie_lsat_ar | | | LSAT analytical reasoning (logic games) problems |
| AGIE LSAT Logical Reasoning | agie_lsat_lr | | | LSAT logical reasoning questions |
| ARC Challenge | arc_challenge | | | Challenging science questions requiring reasoning (AI2 Reasoning Challenge) |
| ARC Easy | arc_easy | | | Easier set of science questions from the ARC dataset |
| BBH | bbh | | | BIG-Bench Hard: challenging tasks from BIG-Bench |
| CoCoNot | coconot | | | Counterfactual commonsense reasoning benchmark |
| CommonsenseQA | commonsense_qa | | | Commonsense reasoning questions |
| HellaSwag | hellaswag | | | Commonsense natural language inference |
| MUSR | musr | | | Multi-step reasoning benchmark |
| PIQA | piqa | | | Physical commonsense reasoning |
| WinoGrande | winogrande | | | Commonsense reasoning via pronoun resolution |
Math
Evaluate mathematical problem-solving across difficulty levels, from grade school to competition-level problems.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AGIE Math | agie_math | | | Advanced mathematical reasoning from AGIE benchmark suite |
| AGIE SAT Math | agie_sat_math | | | SAT mathematics questions |
| AIME 2024 | aime2024 | | | American Invitational Mathematics Examination problems from 2024 |
| AIME 2025 | aime2025 | | | American Invitational Mathematics Examination problems from 2025 |
| GSM8K | gsm8k | | | Grade School Math 8K: multi-step math word problems |
| InfiniteBench Math Calc | infinite_bench_math_calc | | | Mathematical calculations in long contexts |
| InfiniteBench Math Find | infinite_bench_math_find | | | Finding mathematical patterns in long contexts |
| MATH | math | | | Competition-level mathematics problems |
| MGSM | mgsm | | | Multilingual Grade School Math |
Code
Evaluate programming and software development capabilities such as debugging, code execution prediction, and function calling.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| BFCL | bfcl | | | Berkeley Function Calling Leaderboard: tests function calling and tool use capabilities |
| InfiniteBench Code Debug | infinite_bench_code_debug | | | Long-context code debugging tasks |
| InfiniteBench Code Run | infinite_bench_code_run | | | Long-context code execution prediction |
Reading
Evaluate reading comprehension and information extraction from complex texts.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AGIE LSAT Reading Comprehension | agie_lsat_rc | | | LSAT reading comprehension passages and questions |
| AGIE SAT English | agie_sat_en | | | SAT reading and writing questions with passages |
| AGIE SAT English (No Passage) | agie_sat_en_without_passage | | | SAT English questions without accompanying passages |
| DROP | drop | | | Discrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning |
| RACE-H | race_h | | | Reading comprehension from English exams (high difficulty) |
| SQuAD | squad | | | Stanford Question Answering Dataset: extractive question answering on Wikipedia articles |
Long context
Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| InfiniteBench KV Retrieval | infinite_bench_kv_retrieval | | | Key-value retrieval in long contexts |
| InfiniteBench LongBook (English) | infinite_bench_longbook_choice_eng | | | Multiple choice questions on long books |
| InfiniteBench LongDialogue QA (English) | infinite_bench_longdialogue_qa_eng | | | Question answering over long dialogues |
| InfiniteBench Number String | infinite_bench_number_string | | | Number pattern recognition in long sequences |
| InfiniteBench Passkey | infinite_bench_passkey | | | Retrieval of information from long context |
| NIAH | niah | | | Needle in a Haystack: long-context retrieval test |
Safety
Evaluate alignment, bias detection, harmful content resistance, and truthfulness.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AgentHarm | agentharm | Yes | | Tests model resistance to harmful agent behavior and misuse scenarios |
| AgentHarm Benign | agentharm_benign | Yes | | Benign baseline for AgentHarm to measure false positive rates |
| Agentic Misalignment | agentic_misalignment | | | Evaluates potential misalignment in agentic behavior |
| AHB | ahb | | | Agent Harmful Behavior: tests resistance to harmful agentic actions |
| AIRBench | air_bench | | | Tests adversarial instruction resistance |
| BBEH | bbeh | | | Bias Benchmark for Evaluating Harmful behavior |
| BBEH Mini | bbeh_mini | | | Smaller version of BBEH benchmark |
| BBQ | bbq | | | Bias Benchmark for Question Answering |
| BOLD | bold | | | Bias in Open-Ended Language Generation Dataset |
| CYSE3 Visual Prompt Injection | cyse3_visual_prompt_injection | | | Tests resistance to visual prompt injection attacks |
| Make Me Pay | make_me_pay | | | Tests resistance to financial scam and fraud scenarios |
| MASK | mask | Yes | Yes | Tests model’s handling of sensitive information |
| Personality BFI | personality_BFI | | | Big Five personality trait assessment |
| Personality TRAIT | personality_TRAIT | | Yes | Comprehensive personality trait evaluation |
| SOSBench | sosbench | Yes | | Safety and oversight stress test |
| StereoSet | stereoset | | | Measures stereotypical biases in language models |
| StrongREJECT | strong_reject | | | Tests model’s ability to reject harmful requests |
| Sycophancy | sycophancy | | | Evaluates tendency toward sycophantic behavior |
| TruthfulQA | truthfulqa | | | Tests model truthfulness and resistance to falsehoods |
| UCCB | uccb | | | Unsafe Content Classification Benchmark |
| WMDP Bio | wmdp_bio | | | Tests hazardous knowledge in biology |
| WMDP Chem | wmdp_chem | | | Tests hazardous knowledge in chemistry |
| WMDP Cyber | wmdp_cyber | | | Tests hazardous knowledge in cybersecurity |
| XSTest | xstest | Yes | | Exaggerated safety test for over-refusal detection |
Domain-specific
Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| ChemBench | chembench | | | Chemistry knowledge and problem-solving benchmark |
| HealthBench | healthbench | Yes | | Healthcare and medical knowledge evaluation |
| HealthBench Consensus | healthbench_consensus | Yes | | Healthcare questions with expert consensus |
| HealthBench Hard | healthbench_hard | Yes | | Challenging healthcare scenarios |
| LabBench Cloning Scenarios | lab_bench_cloning_scenarios | | | Laboratory experiment planning and cloning |
| LabBench DBQA | lab_bench_dbqa | | | Database question answering for lab scenarios |
| LabBench FigQA | lab_bench_figqa | | | Figure interpretation in scientific contexts |
| LabBench LitQA | lab_bench_litqa | | | Literature-based question answering for research |
| LabBench ProtocolQA | lab_bench_protocolqa | | | Laboratory protocol understanding |
| LabBench SeqQA | lab_bench_seqqa | | | Biological sequence analysis questions |
| LabBench SuppQA | lab_bench_suppqa | | | Supplementary material interpretation |
| LabBench TableQA | lab_bench_tableqa | | | Table interpretation in scientific papers |
| MedQA | medqa | | | Medical licensing exam questions |
| PubMedQA | pubmedqa | | | Biomedical question answering from research abstracts |
| SEC-QA v1 | sec_qa_v1 | | | SEC filing question answering |
| SEC-QA v1 (5-shot) | sec_qa_v1_5_shot | | | SEC-QA with 5 examples |
| SEC-QA v2 | sec_qa_v2 | | | Updated SEC filing benchmark |
| SEC-QA v2 (5-shot) | sec_qa_v2_5_shot | | | SEC-QA v2 with 5 examples |
Multimodal
Evaluate vision and language understanding combining visual and textual inputs.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| DocVQA | docvqa | | | Document Visual Question Answering: questions about document images |
| MathVista | mathvista | | | Mathematical reasoning with visual contexts combining vision and math |
| MMMU Multiple Choice | mmmu_multiple_choice | | | Multimodal understanding with multiple choice format |
| MMMU Open | mmmu_open | | | Multimodal understanding with open-ended responses |
| V*Star Bench Attribute Recognition | vstar_bench_attribute_recognition | | | Visual attribute recognition tasks |
| V*Star Bench Spatial Relationship | vstar_bench_spatial_relationship_reasoning | | | Spatial reasoning with visual inputs |
Instruction following
Evaluate adherence to specific instructions and formatting requirements.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| IFEval | ifeval | | | Tests precise instruction-following capabilities |
System
Basic system validation and pre-flight checks.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| Pre-Flight | pre_flight | | | Basic system check and validation test |
Next steps