View article

Authors

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Ré, Diana Acosta-Navas, Drew A Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda

Publication date

2022/11/16

Journal

arXiv preprint arXiv:2211.09110

Description

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw …

Total citations

Cited by 833

20222023202416 388 425

Scholar articles

Holistic evaluation of language models

P Liang, R Bommasani, T Lee, D Tsipras, D Soylu… - arXiv preprint arXiv:2211.09110, 2022

Holistic evaluation of language models*

R Bommasani, P Liang, T Lee - Annals of the New York Academy of Sciences, 2023

Holistic evaluation of language models, 2022*

P Liang, R Bommasani, T Lee, D Tsipras, D Soylu… - URL https://arxiv. org/abs/2211.09110, 2022

Cited by 19 Related articles

Holistic evaluation of language models. arXiv*

P Liang, R Bommasani, T Lee, D Tsipras, D Soylu… - DOI: https://doi. org/10.48550/arXiv, 2022

Cited by 10 Related articles

Holistic evaluation of language models. CoRR, abs/2211.09110, 2022. doi: 10.48550*

P Liang, R Bommasani, T Lee, D Tsipras, D Soylu… - arXiv preprint arXiv.2211.09110

Cited by 9 Related articles

Holistic Evaluation of Language Models, October 2023*

P Liang, R Bommasani, T Lee, D Tsipras, D Soylu… - URL http://arxiv. org/abs/2211.09110. Issue

Cited by 3 Related articles

Holistic Evaluation of Language Models. CoRR abs/2211.09110 (2022)

P Liang, R Bommasani, T Lee, D Tsipras, D Soylu… - arXiv preprint arXiv:2211.09110, 2022

Cited by 2 Related articles