Introducing ArtifactsBench: Enhancing AI Creative Models

Tencent has introduced a new standard called ArtifactsBench, aimed at addressing current issues in testing creative AI models. This standard plays a crucial role in improving user experience by evaluating the quality of AI-generated code.

Challenges in Testing Creative AI Models

AI models often produce code that functions correctly but may lack the aesthetics and interactivity required for modern user experiences. This raises an important question: how can machines be taught to have good taste?

Current models focus on validating code by ensuring it works, but they overlook the aesthetic and interactive aspects that make a user experience stand out. This is where ArtifactsBench comes in, acting as an automated art critic for AI-generated code.

How Does ArtifactsBench Work?

ArtifactsBench assigns creative tasks to AI, involving over 1,800 diverse challenges, ranging from building web applications and visual designs to creating small interactive games. After the AI generates the code, the standard builds and runs it in a secure environment.

A series of screenshots are captured over time to monitor the application’s behavior, such as changes in state after button clicks or animations. All evidence is then handed over to a multimodal model that serves as the judge.

The Role of the Multimodal Judge in Evaluation

The multimodal judge model evaluates the results using a detailed checklist for each task, including ten different criteria such as functionality, user experience, and aesthetic quality. This ensures the evaluation is fair, consistent, and comprehensive.

Results indicate that the automated judge has good taste in evaluation, aligning with human assessment platforms by 94.4%, a significant improvement over previous standards, which had a maximum agreement of 69.4%.

Exciting Discoveries with ArtifactsBench

Tencent’s tests on more than 30 leading AI models revealed exciting results. Although models specialized in coding might seem best suited for these tasks, general models outperformed them. A general model like Qwen-2.5-Instruct surpassed its specialized counterparts in coding and vision.

Researchers believe this is because creating a good visual application requires multiple skills, such as sound reasoning, precise instruction following, and an inherent aesthetic sense.

Conclusion

ArtifactsBench represents a significant step towards enhancing AI’s ability to produce works that are not only functional but also appealing to users. By evaluating aesthetic and interactive aspects, this standard contributes to notable advancements in creativity and technology. Tencent hopes this standard will become a reliable foundation for assessing creativity in AI models and achieving future progress in this field.