H2O Eval Studio Test Suite Library

This is H2O Eval Studio test suite library for LLM, RAG and agent evaluation.

Test suites can be used for question answering, privacy, fairness, security, summarization and classification evaluation. In addition to that test suites can be combined, sampled, perturbed and customized for specific evaluation needs.

Test suites are provided normalized in H2O Eval Studio JSON format - see also details:

Test Suite (268) Evaluates Purposes Tests (464) Test Cases (1,121,176)
advglue (721) LLM Q&A 1 721
Alan Greenspan Globalization (5) LLM, RAG Q&A 1 5
Analogical Similarity (323) LLM Q&A 1 323
Annual Report Singtel (160) RAG Q&A 1 160
Annual Report Singtel (multi choice) (74) RAG Q&A 1 74
Annual Report Singtel (question type) (153) RAG Q&A 1 153
ARC-Easy (2590) LLM Q&A 1 2590
ARC-Easy (5197) LLM Q&A 1 5197
Bank Teller (6) LLM Q&A 1 6
Banking Act (148) RAG Q&A 1 148
Banking Act (multi choice) (69) RAG Q&A 1 69
BBQ-lite on age - Ambiguous Questions (1840) LLM Q&A 1 1840
BBQ-lite on age - Disambiguated Questions (1840) LLM Q&A 1 1840
BBQ-lite on disability-status - Ambiguous Questions (778) LLM Q&A 1 778
BBQ-lite on disability-status - Disambiguated Questions (778) LLM Q&A 1 778
BBQ-lite on gender - Ambiguous Questions (2836) LLM Q&A 1 2836
BBQ-lite on gender - Disambiguated Questions (2836) LLM Q&A 1 2836
BBQ-lite on nationality - Ambiguous Questions (1540) LLM Q&A 1 1540
BBQ-lite on nationality - Disambiguated Questions (1540) LLM Q&A 1 1540
BBQ-lite on physical-appearance - Ambiguous Questions (788) LLM Q&A 1 788
BBQ-lite on physical-appearance - Disambiguated Questions (788) LLM Q&A 1 788
BBQ-lite on race-ethnicity - Ambiguous Questions (3440) LLM Q&A 1 3440
BBQ-lite on race-ethnicity - Disambiguated Questions (3440) LLM Q&A 1 3440
BBQ-lite on race-x-gender - Ambiguous Questions (7980) LLM Q&A 1 7980
BBQ-lite on race-x-gender - Disambiguated Questions (7980) LLM Q&A 1 7980
BBQ-lite on race-x-ses - Ambiguous Questions (5580) LLM Q&A 1 5580
BBQ-lite on race-x-ses - Disambiguated Questions (5580) LLM Q&A 1 5580
BBQ-lite on religion - Ambiguous Questions (600) LLM Q&A 1 600
BBQ-lite on religion - Disambiguated Questions (600) LLM Q&A 1 600
BBQ-lite on ses - Ambiguous Questions (3432) LLM Q&A 1 3432
BBQ-lite on ses - Disambiguated Questions (3432) LLM Q&A 1 3432
BBQ-lite on sexual-orientation - Ambiguous Questions (432) LLM Q&A 1 432
BBQ-lite on sexual-orientation - Disambiguated Questions (432) LLM Q&A 1 432
Ben Bernanke Economic Outlook (5) RAG, LLM Q&A 1 5
Broker Agreement (59) RAG Q&A 1 59
Broker Agreement (multi choice) (19) RAG Q&A 1 19
Broker Agreement (question type) (58) RAG Q&A 1 58
cause_and_effect (102) LLM Q&A 1 102
CBA AnnualReport 2023 (33) RAG Q&A 1 33
CBA AnnualReport 2023 (multi choice) (106) RAG Q&A 1 106
ChallengingToxicityPrompts (1000) LLM Q&A 1 1000
ChallengingToxicityPrompts (2392) LLM Q&A 1 2392
ChallengingToxicityPrompts (1000) LLM Q&A 1 1000
Chinese Linguistics & Cognition Challenge (CLCC) (760) LLM Q&A 1 760
Chinese Version - Bias Benchmark for QA (1392) LLM Q&A 1 1392
Chinese Version - Bias Benchmark for QA (980) LLM Q&A 1 980
Chinese Version - Bias Benchmark for QA (586) LLM Q&A 1 586
Chinese Version - Bias Benchmark for QA (11988) LLM Q&A 1 11988
Chinese Version - Bias Benchmark for QA (1534) LLM Q&A 1 1534
Chinese Version - Bias Benchmark for QA (13528) LLM Q&A 1 13528
Chinese Version - Bias Benchmark for QA (2176) LLM Q&A 1 2176
Chinese Version - Bias Benchmark for QA (8700) LLM Q&A 1 8700
Chinese Version - Bias Benchmark for QA (2176) LLM Q&A 1 2176
Chinese Version - Bias Benchmark for QA (1534) LLM Q&A 1 1534
Chinese Version - Bias Benchmark for QA (8700) LLM Q&A 1 8700
Chinese Version - Bias Benchmark for QA (7400) LLM Q&A 1 7400
Chinese Version - Bias Benchmark for QA (3960) LLM Q&A 1 3960
Chinese Version - Bias Benchmark for QA (13528) LLM Q&A 1 13528
Chinese Version - Bias Benchmark for QA (3960) LLM Q&A 1 3960
Chinese Version - Bias Benchmark for QA (1856) LLM Q&A 1 1856
Chinese Version - Bias Benchmark for QA (560) LLM Q&A 1 560
Chinese Version - Bias Benchmark for QA (1856) LLM Q&A 1 1856
Chinese Version - Bias Benchmark for QA (1588) LLM Q&A 1 1588
Chinese Version - Bias Benchmark for QA (586) LLM Q&A 1 586
Chinese Version - Bias Benchmark for QA (13528) LLM Q&A 1 13528
Chinese Version - Bias Benchmark for QA (11988) LLM Q&A 1 11988
Chinese Version - Bias Benchmark for QA (1588) LLM Q&A 1 1588
Chinese Version - Bias Benchmark for QA (7400) LLM Q&A 1 7400
Chinese Version - Bias Benchmark for QA (980) LLM Q&A 1 980
Chinese Version - Bias Benchmark for QA (1392) LLM Q&A 1 1392
Chinese Version - Bias Benchmark for QA (13528) LLM Q&A 1 13528
Chinese Version - Bias Benchmark for QA (560) LLM Q&A 1 560
Constitution of the Republic of Singapore (29) RAG Q&A 1 29
Constitution of the Republic of Singaporet (multi choice) (54) RAG Q&A 1 54
Contextual Parametric Knowledge Conflicts (17528) LLM Q&A 1 17528
Cyber Security Policy (131) RAG Q&A 1 131
Cyber Security Policy (multi choice) (80) RAG Q&A 1 80
Cyber Security Policy (question type) (125) RAG Q&A 1 125
CyberSecEval Prompt Injection (251) LLM Q&A 1 251
Defense Management (136) RAG Q&A 1 136
Defense Management (multi choice) (74) RAG Q&A 1 74
Defense Management (question type) (137) RAG Q&A 1 137
Digital Health Guidelines (141) RAG Q&A 1 141
Digital Health Guidelines (multi choice) (64) RAG Q&A 1 64
Digital Health Guidelines (question type) (127) RAG Q&A 1 127
Employment Contract (67) RAG Q&A 1 67
Employment Contract (multi choice) (40) RAG Q&A 1 40
Employment Contract (question type) (60) RAG Q&A 1 60
enronemail (166418) LLM Q&A 1 166418
Ethics-Commonsense-Hard (1000) LLM Q&A 1 1000
Ethics-Commonsense-Hard (1000) LLM Q&A 1 1000
Ethics-Commonsense-Hard (1000) LLM Q&A 1 1000
Ethics-Commonsense-Hard (1000) LLM Q&A 1 1000
EU AI Act (26) RAG Q&A 1 26
EU AI Act (multi choice) (48) RAG Q&A 1 48
EU AI Act (question type) (144) RAG Q&A 1 144
Facts about Asia pacific in True and False in 4 languages (Chinese, Malay, Tamil and English) (22) LLM Q&A 1 22
Facts about Singapore in True and False (50) LLM Q&A 1 50
FBIAgentGPT (20) LLM, RAG Q&A 1 20
Financial Records Management (138) RAG Q&A 1 138
Financial Records Management (multi choice) (89) RAG Q&A 1 89
Financial Statements Alphabet Tesla (77) RAG Q&A 1 77
Financial Statements Alphabet Tesla (multi choice) (79) RAG Q&A 1 79
Food in Singapore (100) LLM Q&A 1 100
Frank Summarization (small) (7) LLM summarization 1 7
Frank Summarization (small) (499) LLM summarization 1 499
GAIA (203) LLM, RAG, agent Q&A 39 203
GAIA (tasks w/o documents) (127) RAG, LLM, agent Q&A 1 127
Gender Occupational Bias (13) LLM Q&A 1 13
Gender Occupational Bias (13) LLM Q&A 1 13
gre_reading_comprehension (32) LLM Q&A 1 32
GSM8K (8792) LLM Q&A 1 8792
H2O.ai Eval GPT (60) LLM Q&A 1 60
h2oGPTe Benchmark (2023-11-15) (122) RAG, LLM Q&A 34 122
h2oGPTe Benchmark (2024-08-26) (154) RAG Q&A 63 154
h2oGPTe Benchmark (2024-10-01) (155) RAG Q&A 64 155
Health Service Standards (34) RAG Q&A 1 34
Health Service Standards (multi choice) (71) RAG Q&A 1 71
Health Service Standards (question type) (82) RAG Q&A 1 82
HellaSwag (49947) LLM Q&A 1 49947
Home Affairs (109) RAG Q&A 1 109
Home Affairs (multi choice) (61) RAG Q&A 1 61
Home Affairs (question type) (115) RAG Q&A 1 115
HR Policy (149) RAG Q&A 1 149
HR Policy (multi choice) (86) RAG Q&A 1 86
HR Policy (question type) (122) RAG Q&A 1 122
HR Policy Procedures (multi choice) (118) RAG Q&A 1 118
HR Policy Procedures (question type) (148) RAG Q&A 1 148
HSBC Annual Report (126) RAG Q&A 1 126
HSBC Annual Report (multi choice) (84) RAG Q&A 1 84
HSBC Annual Report (question type) (121) RAG Q&A 1 121
Iconic Places in Singapore (16) LLM Q&A 1 16
IMF Financial Statements (130) RAG Q&A 1 130
IMF Financial Statements (multi choice) (68) RAG Q&A 1 68
Immigration in Singapore (128) RAG Q&A 1 128
Immigration in Singapore (multi choice) (67) RAG Q&A 1 67
Immigration in Singapore (question type) (122) RAG Q&A 1 122
In-the-wild Jailbreak Dataset (22) LLM Q&A 1 22
Inappropriate Image Prompts (I2P) (4703) LLM Q&A 1 4703
Information Security (136) RAG Q&A 1 136
Information Security (multi choice) (74) RAG Q&A 1 74
Information Security (question type) (125) RAG Q&A 1 125
Information Security Policy (156) RAG Q&A 1 156
Information Security Policy (multi choice) (67) RAG Q&A 1 67
Information Security Policy (question type) (146) RAG Q&A 1 146
Inherent Risk Assessment (122) RAG Q&A 1 122
Inherent Risk Assessment (multi choice) (59) RAG Q&A 1 59
Inherent Risk Assessment (question type) (116) RAG Q&A 1 116
IRS Document 1 (41) RAG Q&A 1 41
IRS Document 1 (multi choice) (62) RAG Q&A 1 62
IRS Document 1 (question type) (146) RAG Q&A 1 146
IRS Document 2 (31) RAG Q&A 1 31
IRS Document 2 (multi choice) (47) RAG Q&A 1 47
IRS Document 2 (question type) (127) RAG Q&A 1 127
IRS Strategic Operating Plan (134) RAG Q&A 1 134
IRS Strategic Operating Plan (multi choice) (45) RAG Q&A 1 45
IRS Strategic Operating Plan (question type) (128) RAG Q&A 1 128
Kaggle: LLM Science Exam competition (200) RAG, LLM Q&A 1 200
Kaggle: LLM Science Exam competition (small) (25) RAG, LLM Q&A 1 25
List of Occupations (7270) LLM Q&A 1 7270
m3exam-indonesian (1580) LLM Q&A 1 1580
m3exam-thai (2126) LLM Q&A 1 2126
m3exam-vietnamese (1745) LLM Q&A 1 1745
Maternal Child Healthcare (156) RAG Q&A 1 156
Maternal Child Healthcare (multi choice) (112) RAG Q&A 1 112
Maternal Child Healthcare (question type) (147) RAG Q&A 1 147
MedMCQA (182822) LLM Q&A 1 182822
MedQA (US) (10178) LLM Q&A 1 10178
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives (720) LLM Q&A 1 720
MLC Baseline - Chemical, Biological, Radiological, Nuclear, and Explosives (1570) LLM Q&A 1 1570
MLC Baseline - Child Abuse and Exploitation (720) LLM Q&A 1 720
MLC Baseline - Child Abuse and Exploitation (1330) LLM Q&A 1 1330
MLC Baseline - Hate (26530) LLM Q&A 1 26530
MLC Baseline - Hate (720) LLM Q&A 1 720
MLC Baseline - Non-Violent Crimes (1810) LLM Q&A 1 1810
MLC Baseline - Non-Violent Crimes (720) LLM Q&A 1 720
MLC Baseline - Sex-Related Crimes (1330) LLM Q&A 1 1330
MLC Baseline - Sex-Related Crimes (720) LLM Q&A 1 720
MLC Baseline - Suicide & Self-Harm (720) LLM Q&A 1 720
MLC Baseline - Suicide & Self-Harm (1090) LLM Q&A 1 1090
MLC Baseline - Violent Crimes (720) LLM Q&A 1 720
MLC Baseline - Violent Crimes (4390) LLM Q&A 1 4390
MMLU (17487) LLM Q&A 1 17487
MMLU (Arabic) (12928) LLM Q&A 1 12928
MMLU (Arabic) 50 prompts (50) LLM Q&A 1 50
MMLU Anatomy (135) LLM Q&A 1 135
MMLU Clinical Knowledge (265) LLM Q&A 1 265
MMLU College Biology (144) LLM Q&A 1 144
MMLU College Medicine (173) LLM Q&A 1 173
MMLU Medical Genetics (100) LLM Q&A 1 100
MMLU Professional Medicine (272) LLM Q&A 1 272
NIST AI 600-1 (528) RAG, LLM Q&A 1 528
NIST AI 600-1 (small) (30) LLM, RAG Q&A 1 30
OIG ARRA (5) LLM, RAG Q&A 1 5
Personally Identifiable Information leakage (8) LLM privacy, security 1 8
Places in Singapore (50) LLM Q&A 1 50
Policy Document (141) RAG Q&A 1 141
Policy Document (multi choice) (110) RAG Q&A 1 110
Product Disclosure Statement (130) RAG Q&A 1 130
Product Disclosure Statement (multi choice) (71) RAG Q&A 1 71
PubMedQA (1000) LLM Q&A 1 1000
Question type Financial Statements Alphabet Tesla (365) RAG Q&A 1 365
RealtimeQA (50) LLM Q&A 1 50
RealToxicityPrompts (198884) LLM Q&A 1 198884
Red Teaming (Giskard AI) (9) LLM, RAG security 1 9
Risk Management Guidelines (140) RAG Q&A 1 140
Risk Management Guidelines (multi choice) (91) RAG Q&A 1 91
Risk Management Guidelines (question type) (123) RAG Q&A 1 123
Risk Management Policy (37) RAG Q&A 1 37
risk management policy (multi choice) (62) RAG Q&A 1 62
risk management policy (question type) (96) RAG Q&A 1 96
Risk Management Techniques Tool (133) RAG Q&A 1 133
Risk Management Techniques Tool (multi choice) (93) RAG Q&A 1 93
Risk Management Techniques Tool (question type) (120) RAG Q&A 1 120
SA Home Affairs (148) RAG Q&A 1 148
SA Home Affairs (multi choice) (74) RAG Q&A 1 74
SA Home Affairs (question type) (140) RAG Q&A 1 140
Safety Benchmark (Singapore Context) (59) LLM Q&A 1 59
Samsum Summarization (46) LLM summarization 1 46
Sensitive data leakage (8) LLM, RAG privacy 1 8
sg-legal-glossary (425) LLM Q&A 1 425
sg-university-tutorial-questions-legal (32) LLM Q&A 1 32
Singapore Cyber Landscape (110) RAG Q&A 1 110
Singapore Cyber Landscape (multi choice) (64) RAG Q&A 1 64
Singapore Cyber Landscape (question type) (104) RAG Q&A 1 104
Singapore Labour Force (120) RAG Q&A 1 120
Singapore Labour Force (multi choice) (61) RAG Q&A 1 61
Singapore Labour Force (question type) (113) RAG Q&A 1 113
Singapore Polical History (21) LLM Q&A 1 21
Singapore Transport System (12) LLM Q&A 1 12
Singapore Transport System (27) LLM Q&A 1 27
squad-shifts-tnf (48201) LLM Q&A 1 48201
SR 11-7 (171) RAG, LLM Q&A 1 171
SR 11-7 (small) (7) RAG, LLM Q&A 1 7
Stanford Healthcare Regulations (122) RAG Q&A 1 122
Stanford Healthcare Regulations (multi choice) (67) RAG Q&A 1 67
Stanford Healthcare Regulations (question type) (114) RAG Q&A 1 114
Summeval Summarization (100) LLM summarization 1 100
tamil-news-classification (3631) LLM Q&A 1 3631
tamil-thirukural (266) LLM Q&A 1 266
tanglish-tweets-SA (1163) LLM Q&A 1 1163
Technical Report (26) RAG Q&A 1 26
Technical Report (multi choice) (50) RAG Q&A 1 50
Technical Report (question type) (82) RAG Q&A 1 82
Telcom Customer Service Information (141) RAG Q&A 1 141
Telcom Customer Service Information (multi choice) (69) RAG Q&A 1 69
Telcom Customer Service Information (question type) (133) RAG Q&A 1 133
Telecom Infrastructure Planning (154) RAG Q&A 1 154
Telecom Infrastructure Planning (multi choice) (98) RAG Q&A 1 98
Telecom Infrastructure Planning (question type) (153) RAG Q&A 1 153
Telecommunication Regulations (117) RAG Q&A 1 117
Telecommunication Regulations (multi choice) (74) RAG Q&A 1 74
Telecommunication Regulations (question type) (114) RAG Q&A 1 114
Telecommunications regulation strategy policy (120) RAG Q&A 1 120
Telecommunications regulation strategy policy (multi choice) (75) RAG Q&A 1 75
Telecommunications regulation strategy policy (question type) (116) RAG Q&A 1 116
truthfulqa (817) LLM Q&A 1 817
TruthfulQA (MCQ Version) (483) LLM Q&A 1 483
uciadult (32561) LLM Q&A 1 32561
uciadult (32561) LLM Q&A 1 32561
UPC Agreement (153) RAG Q&A 1 153
UPC Agreement (multi choice) (66) RAG Q&A 1 66
UPC Agreement (question type) (141) RAG Q&A 1 141
US Veterans Affairs (127) RAG Q&A 1 127
US Veterans Affairs (multi choice) (29) RAG Q&A 1 29
US Veterans Affairs (question type) (121) RAG Q&A 1 121
winobias-variation1 (396) LLM Q&A 1 396
Winogrande (41665) LLM Q&A 1 41665

Test Suites