A sweeping new global exam exposes how far even the most advanced AI systems remain from matching deep human expertise, reshaping how researchers measure machine intelligence and its limits. Credit: Stock
As artificial intelligence systems rapidly outgrow traditional academic benchmarks, researchers have unveiled an ambitious new test designed to probe the true limits of machine intelligence.
When advanced artificial intelligence systems began scoring near-perfect marks on established academic tests, researchers recognized a growing concern. The exams that once posed serious challenges were no longer difficult enough to meaningfully evaluate cutting-edge AI. Well-known benchmarks such as the Massive Multitask Language Understanding (MMLU) exam, previously viewed as rigorous, have become less effective at distinguishing true progress in AI capability.
In response, an international group of nearly 1,000 researchers, including a professor from Texas A&M University, developed a far more demanding assessment. Their goal was to design an exam so comprehensive and grounded in specialized human expertise that today’s AI systems would struggle to pass it.
The result is “Humanity’s Last Exam” (HLE), a 2,500-question test that covers mathematics, the humanities, natural sciences, ancient languages, and highly specialized academic fields. The project is described in a paper published in Nature, and additional details are available at lastexam.ai.
One of the contributors is Dr. Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M. He helped write and refine questions for the assessment.
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” Nguyen said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context, and specialized expertise.”
The point wasn’t to stump humans. It was to reveal, precisely and systematically, what AI cannot do, at least not yet.
A global effort to measure AI’s limitsSpecialists from around the world drafted and reviewed the HLE questions. Each item was required to have one clear, verifiable answer and to resist being solved through quick online searches. The material reflects advanced scholarship, ranging from translating ancient Palmyrene inscriptions to identifying tiny anatomical structures in birds and examining the detailed sound patterns of Biblical Hebrew.
Before being included, every question was tested on leading AI systems. If a model produced the correct answer, that question was eliminated. This process ensured the final exam would remain just beyond the reach of current AI performance.
The results show how difficult the assessment is. Early testing found that even top models struggled. GPT-4o scored 2.7%. Claude 3.5 Sonnet achieved 4.1%. OpenAI’s o1 model reached 8%. More recent systems, including Gemini 3.1 Pro and Claude Opus 4.6, have improved to roughly 40-50% accuracy, but they still do not demonstrate full mastery.
Why a new benchmark mattersAccording to Nguyen, the fact that AI has surpassed older benchmarks carries real-world consequences. He contributed 73 of the 2,500 public questions (the second-highest author) and wrote more questions in math and computer science than any other contributor.
“Without accurate assessment tools, policymakers, developers, and users risk misinterpreting what AI systems can actually do,” he said. “Benchmarks provide the foundation for measuring progress and identifying risks.”
As explained in the team’s paper, high scores on human-designed exams do not automatically indicate genuine intelligence. Such tests measure performance on tasks originally created for people, not machines. Strong results may reflect pattern matching rather than deep understanding.
Not a threat, a toolDespite its apocalyptic name, Humanity’s Last Exam isn’t meant to suggest the end of human relevance. Instead, it highlights how much knowledge remains uniquely human and how far AI systems still have to go.
“This isn’t a race against AI,” Nguyen said. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”
A future-proof examHLE is intended to serve as a long‑term, transparent benchmark for evaluating advanced AI systems. As part of that mission, the team has made some of the exam publicly available, while keeping most of the test questions hidden so AI models can’t memorize the answers.
“For now, Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,” Nguyen said, “and despite rapid technological advances, it remains wide.”
Research on a grand scaleNguyen noted the massive project reflects the importance of interdisciplinary, international research efforts.
“What made this project extraordinary was the scale,” he said. “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems —perhaps ironically, it’s humans working together.”
Reference: “A benchmark of expert-level academic questions to assess AI capabilities” by Center for AI Safety, Scale AI and HLE Contributors Consortium, 28 January 2026, Nature.
DOI: 10.1038/s41586-025-09962-4
Funding: The Center for AI Safety and Scale AI consortia
Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.