Challenges and Recommendations for AI Standards
A recent academic review suggests that AI standards might be flawed, potentially leading to critical decisions in organizations based on misleading data. These general standards are often relied upon to compare model capabilities, but the study indicates that trust in these standards may be misplaced.
Challenges in AI Standards
The comprehensive study titled “Measuring What Matters: Construct Validity in Large Language Model Standards” revealed that many standards suffer from weaknesses in one or more areas, undermining the scientific credibility of claims about model performance. This poses a particular challenge for technology leaders in organizations who depend on these standards for their investment and strategic decisions.
The study shows that key concepts in AI evaluation are often “poorly defined or weakly applied,” leading to inadequately supported scientific claims, misguided research, and policy applications not based on strong evidence.
Where AI Standards Fail in Organizations
The review identified systemic flaws in all aspects of the design and reporting of standard results. Vague or disputed concepts present a significant barrier, with 47.8% of the definitions provided found to be unclear or contested. For example, differences in the results of a “non-harm” standard may reflect different definitions of the concept rather than a real difference in model safety.
The study also highlighted a lack of statistical precision, with only a small percentage of standards using uncertainty estimates or statistical tests to compare model results, making it difficult to validate small differences between different models.
Impacts on Organizational Decisions
The study indicates that organizations heavily rely on overall model scores for decision-making, while this review clarifies that these scores may not accurately reflect model performance in the real world. For instance, high scores in some standards might reflect a model’s ability to memorize rather than its capacity for complex reasoning.
The study also warned against using non-representative datasets, as 27% of the standards relied on “convenience samples,” which could obscure model weaknesses in real-world situations.
Toward Internal Evaluation and Custom Standards
The study recommended that organizations should not rely solely on general standards but should develop their own standards tailored to their specific business nature. This includes defining the phenomenon to be measured precisely, creating a dataset that represents real-world challenges and scenarios, conducting error analysis to understand failure causes, and justifying the validity of the standards used.
The study emphasizes that progress in AI usage depends on collaboration between governments, academia, and industry, adopting open dialogue and shared standards to build trust in intelligent systems.
Conclusion
The study shows that the only reliable path to progress in AI is to stop relying on general standards and start “measuring what matters” for our institutions. The study calls for the development of internal standards specific to each organization, ensuring that these standards truly reflect practical value in the real world.