This is a Plain English Papers summary of a research paper called AI Math vs. Reality: Category Theory Exposes the Gap. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
The Gap Between AI Math Claims and Reality
Recent claims by AI companies tout impressive mathematical capabilities. OpenAI's o3-mini reportedly solved 80% of problems on the American Invitational Mathematics Examination 2024, while xAI's Grok-3 claims similar excellence in math and physics. These assertions clash dramatically with academic research suggesting AI systems can solve only 2% of mathematical research problems.
This stark disparity motivated mathematician Răzvan Diaconescu to conduct a focused experiment testing how leading AI systems perform on category theory problems. The goal was twofold: understand how AI might assist working mathematicians and identify improvement opportunities for AI developers.
The experiment, detailed in this research paper, reveals the significant gap between marketing claims and actual mathematical capabilities of current AI systems. This gap connects to broader discussions about AI's inherent mathematical difficulties that limit even the most advanced models.
Methodology: Selecting a Suitable Mathematical Problem
The researcher designed a methodical approach with three components:
Choice of mathematical area: Category theory was selected as ideal because it's well-recognized with standardized concepts and extensive literature that AI systems should easily access.
Selection of an appropriate problem: The researcher sought a problem that was straightforward yet not trivial, clear to formulate, but not commonly found in textbooks. This led to selecting a problem involving inclusion systems - a concept with sufficient literature presence but not mainstream enough to be a standard exercise.
-
Analysis framework: The evaluation examined three aspects of AI performance:
- Data gathering capabilities
- Mathematical language quality
- Reasoning quality
Some might argue that selecting a less mainstream topic deliberately hinders AI systems. However, this reflects the reality of mathematical research, where professionals tackle problems that haven't been extensively studied rather than standard exercises.
This experimental design connects to broader research on bridging mathematics and AI creativity, exploring the limits of machine capabilities in mathematical domains.
The Problem and Its Solution: Inclusion Systems and Pullbacks
Understanding Inclusion Systems
Inclusion systems were introduced in theoretical computer science but have since found applications in both computer science and logic, particularly model theory. The concept provides an abstract axiomatic approach to substructures and quotient structures across various categories.
Definition: A pair of categories (ℐ,ℰ) is an inclusion system for a category 𝒞 if:
- ℐ and ℰ are broad subcategories of 𝒞
- ℐ is a partial order
- Every arrow f in 𝒞 can be factored uniquely as f = ef;if where ef ∈ ℰ and if ∈ ℐ
Illustration of how an arrow factors through abstract surjections and abstract inclusions in an inclusion system.
This definition provides a framework for understanding substructures across various mathematical domains, from algebra to topology to logical systems.
The Problem of Unique Pullbacks
The selected problem asked:
Prove that in any category with an inclusion system each cospan (sink) of abstract inclusions that has a pullback, has a unique pullback consisting of abstract inclusions.
The proof follows these key steps:
- Consider a cospan of abstract inclusions (i₁,i₂) and a pullback cone (f₁,f₂)
A cospan of abstract inclusions (i₁,i₂) with a pullback cone (f₁,f₂).
- Factor one of the arrows (f₁) through the inclusion system as f₁=e₁;i'₁
Factoring f₁ into an abstract surjection e₁ followed by an abstract inclusion i'₁.
- Use the Diagonal Fill-in property to obtain a unique arrow h such that e₁;h=f₂ and i'₁;i₁=h;i₂
Application of the Diagonal Fill-in property to obtain the connecting arrow h.
- Prove that h is an abstract inclusion and that (i'₁,h) forms a pullback cone of abstract inclusions
The resulting pullback cone (i'₁,h) consisting entirely of abstract inclusions.
- Establish the uniqueness of this pullback cone
This problem represents the kind of mathematical reasoning that connects to formal mathematical reasoning as a frontier for AI, requiring precise understanding of definitions and properties.
Analyzing AI Performance: How o3-mini and Grok-3 Approach Category Theory
Data Gathering and Definition Understanding
Surprisingly, both AI systems struggled with gathering relevant data about inclusion systems, despite substantial literature on the topic:
o3-mini achieved partial success but presented the definition clumsily. It relied on an undefined "factorization system" concept without clarifying which version it meant - problematic since several different definitions exist in the literature. Without references, the definition remained incomplete.
Grok-3 failed catastrophically, producing a completely incorrect definition with crucial parts missing. It confused a derived property (stability under pullbacks) as part of the definition and formulated even that incorrectly.
This fundamental struggle with mathematical definitions points to a significant weakness in current AI systems. The researcher hypothesizes that AI models struggle to extract precise mathematical concepts from narrative text, even when that text uses explicit mathematical language.
These definition issues illustrate challenges in AI's mathematical reasoning capabilities that might not be apparent from performance on more standardized problems.
Mathematical Language and Notation
Both AI systems displayed concerning imprecisions in mathematical language:
They used vague expressions like "essentially uniquely," "typically," and "in many formulations one can prove" - language that mathematicians find unacceptable or at least annoying.
-
o3-mini exhibited a more serious failure regarding mathematical notation, particularly with arrow composition. In category theory, two notation styles exist:
- The set-theoretic style using g∘f for composition
- The diagrammatic style using f;g for composition
o3-mini confusingly mixed these notations, using the same symbol (∘) with both ordering conventions within a single equation, creating nonsensical mathematical formulas - considered one of the gravest failures in mathematical communication.
This confusion highlights the challenges AI systems face with the precise, context-sensitive nature of mathematical notation, which remains a significant barrier to their effective use in mathematical research.
Mathematical Reasoning Quality
The reasoning capabilities of the two systems differed substantially:
o3-mini produced valid reasoning for approximately the first half of the problem (the existence part), following a similar approach to the researcher's solution but with some differences. However, it reduced the critical uniqueness portion to mere hints without necessary details.
Grok-3's reasoning was fundamentally compromised by its incorrect understanding of inclusion systems. It provided only trivial reasoning and constructed a "twisted argument" for uniqueness based on misunderstandings of the core concept.
Even o3-mini's partial success reveals limitations in AI's ability to fully construct and communicate mathematical proofs at the level required for research mathematics.
Conclusions and Recommendations: The Real State of AI for Mathematical Research
The experiment revealed significant limitations in current AI systems' mathematical capabilities:
Definition retrieval failures: Both systems struggled to accurately extract and present mathematical definitions, with Grok's failure being particularly severe.
Inappropriate mathematical language: Both systems used imprecise language and notation that would be unacceptable in mathematical communication.
Limited reasoning abilities: While o3-mini showed some legitimate reasoning capabilities, both systems fell short of complete, rigorous mathematical proofs.
For AI developers, the researcher recommends building meta-level checking mechanisms for mathematical coherence, notation consistency, and precision of language.
For mathematicians, the advice is clear: do not rely on AI-provided definitions or results without verification against reliable literature. AI systems cannot yet replace human understanding, particularly in concept creation - a fundamental aspect of advanced mathematics.
For those seeking AI mathematical assistance, the researcher suggests combining "soft" AI with formal theorem provers like Vampire, Spass, or E, which can at least verify results expressible in first-order logic.
These limitations connect to broader discussions about AI's inherent mathematical challenges, suggesting that despite impressive marketing claims, AI systems remain far from achieving human-level mathematical research capabilities.