OpenAI open source HealthBench, 60 countries work together to develop 5000 real conversations

Crypto 7x24

2025-05-13 08:17:03

6248

Internet reports that OpenAI has open-source a test and evaluation set specifically for large medical models-HealthBench. Unlike previous test sets, the 5000 core test conversations of this test set were all created by 262 doctors from 26 specialties in 60 countries/regions, which greatly enhanced the difficulty, authenticity and richness of the test set. It also uses a multi-round dialogue test, rather than a simple answer or multiple-choice model. According to the test data, the performance of large models in the health care field has improved significantly. For example, from the previous 16% of GPT-3.5Turbo to 32% of GPT-4o, and then to 60% of o3, the overall performance has improved significantly. Especially the progress in small models is even more prominent. The GPT-4.1nano not only surpasses the GPT-4o in performance, but also reduces the cost by 25 times.

Disclaimer: The views in this article are from the original Creator and do not represent the views or position of Hawk Insight. The content of the article is for reference, communication and learning only, and does not constitute investment advice. If it involves copyright issues, please contact us for deletion.

NewFlash