Mathematics isn't only about saying true things. It's about asking the right questions, being confused, stumbling about, getting distracted, being wrong, recognizing when you're wrong, being stuck. Mostly being stuck. It's about clinging to a giant edifice and feeling it out until you understand some tiny piece of it. It's about finding meaning in and intuition for the texture of an object which, at first, can only be apprehended by bashing your skull into it until it imprints on your forehead. Then trying to convey some of that insight to someone else, and watching as they find their own way to it.

I started trying to get LLMs to do math in July 2020, through the game "AI Dungeon," one of the earliest applications powered by GPT-3. I first got GPT-3 to produce a correct proof (of Fermat's Little Theorem) in April 2022. At the time I did not think they would become useful for math research in the near term.

An attempt at getting GPT-3 to do arithmetic using the game "AI Dungeon." It refused. — An attempt at getting GPT-3 to do arithmetic using the game "AI Dungeon” in 2020. It refused.

This changed when the first reasoning models were released: ¹ on February 1, 2025, I wrote that the model o3-mini-high “clearly has passed the threshold of genuine usefulness” for research, while still making many, many mistakes. Since then, the models have improved, and ChatGPT 5.2 Pro (released in December 2025) can regularly provide reasonable proofs of lemmas that I would characterize as “involved but routine for experts,” though it still makes many errors. And I have been using Codex, OpenAI's coding/computer use agent, for scientific computing tasks I would not have considered attempting a few months ago. ²

In public comments, I've tried to credit successes while pushing back against hype. I've talked a lot about "slop" papers on arXiv. I have worried that we are polluting the scientific commons with incorrect mathematics whose errors are enormously difficult to detect. I've tried to focus on the present. In this essay I'll talk about the future.

The future

Since early last year I've been telling my colleagues that I expected models to be able to autonomously produce research mathematics comparable to some of the output of the best human mathematicians by 2040, but it seemed unlikely that this would occur by 2030. I expected major successes before then, including perhaps proofs of minor but interesting open conjectures ³ in the next year and more important open conjectures not too long afterwards. I thought there were likely serious obstructions to autonomous mathematics at the level of the best human mathematicians that remained to be overcome, and that benchmark results were not telling the whole story.

I think I have been underrating the pace of model improvements. In March 2025 I made a bet with Tamay Besiroglu, cofounder of RL environment company Mechanize, that AI tools would not be able to autonomously produce papers I judge to be at a level comparable to that of the best few papers published in 2025, at comparable cost to human experts, by 2030. I gave him 3:1 odds at the time; I now expect to lose this bet.

Much of what I'll say here is not factually very different from what I've written before. I've slowly updated my timelines over the past year, but if one wants to speculate about the long-term future of math research, a difference of a few years is not so important. My trigger for writing this post is that, despite all of the above, I think I was not correctly calibrated as to the capabilities of existing models, let alone near-future models. This was more apparent in the mood of my comments than their content, which was largely cautious.

I recently made an incorrect prediction about existing model capabilities. A number of world-class mathematicians initiated a wonderful project they dubbed "First Proof" (a baking pun), aimed at measuring the usefulness of current models for certain math research tasks. The first round, which was essentially a trial balloon, consisted of 10 lemmas taken from the authors' own unpublished work. Most of these are not research projects on their own, but they are natural examples of non-trivial tasks that appear in the daily life of a research mathematician.

At the paper's release I weakly expected that existing models would come up with proofs of 2-3 of these questions (weakly because they are all rather far from my work, and it took some time for me to evaluate their difficulty), but thought that 4-5 solutions was plausible. While there is no official grading for this round, it appears at the time of this writing that somewhere between 6 and 8 have been solved, if one combines all attempts (and an enormous amount of garbage has been produced). Some solutions are arguably semi-autonomous rather than autonomous. Is this meaningfully different from what I had expected? Though I already knew that existing models could sometimes usefully prove lemmas that show up in the day-to-day work of a mathematician, I found it notable. This experiment suggests that, with sufficient scaffolding, they can often prove such lemmas.

Are there serious remaining obstructions to doing more?

While one can look for reasons to discount this error of mine, ⁴ and I'll discuss some below, I've instead tried to take this as an opportunity to re-examine my views on AI progress and give a less cautious view of where I stand on the obstructions to producing high-quality autonomous math research.

Like many mathematicians, I find much discussion around AI-for-math to be filled with hype or outright quackery, and much of my commentary has focused on this. I've been very critical of AI-for-math hype. So I hope you will take me seriously when I say that it's not all hype.

First Proof, FrontierMath, etc.

Despite not yet being a benchmark, the First Proof project is by far the best measure of model usefulness for science and math research available today, and I very much hope that frontier labs continue to take future rounds seriously. I will briefly explain the project and some related benchmarks, and then discuss what they tell us about the prospect for future automation and the role of human mathematicians.

The most informative benchmark for mathematics up until now has been FrontierMath. FrontierMath consists of problems with numerical answers (not proofs) which can thus be easily graded by a machine. It is divided into two parts, Tier 1-3, on which the best models, as of this writing, have about 40% pass@1 accuracy, and Tier 4, consisting of problems intended to be harder, at which the best models have about 32% pass@1 accuracy.

By contrast the first batch of First Proof problems consists of 10 lemmas taken from the authors' own unpublished work; this was intended as a calibration stage, and the solutions will not be officially graded. Instead of a number, answers require proof. Producing such a problem list, let alone grading solutions, is extremely labor-intensive.

As the authors note in their paper announcing the project, proving such lemmas is a fairly small portion of the mathematical research process. In my own work I find that discovering the statements of crucial lemmas is often much more difficult than proving them---my sense is that this varies somewhat from field to field. And so an important disclaimer is that a model that can reliably solve 100% of First Proof-style problems would be immensely useful but not necessarily capable of doing high-quality autonomous research.

What constitutes a solution? The authors write that "we consider that an AI model has answered one of our questions if it can produce in an autonomous way a proof that conforms to the levels of rigor and scholarship prevailing in the mathematics literature. In particular, the AI should not rely on human input for any mathematical idea or content, or to help it isolate the core of the problem."

It is by necessity a messy project. The problems contained in the initial batch cover a wide range of fields and difficulty levels; the problems, comments, and official solutions can be found here. Of the 10 problems, two of them (problems 9 and 10) could be solved by publicly available models "out of the box." The authors were aware of this; the lemmas were not chosen adversarially. In fact problem 10 had an answer that was more or less directly available in the literature, and problem 9 was (in my view) a pretty minor variant of previous work of one of the authors. A sketch of a solution to Problem 1 was available online, with many details omitted.

Note that my own work is not particularly close to any of the problems; the ones whose statements I could understand without looking up any words were 2, 4, 6, 7, 8, 9. I invested a substantial amount of time trying to understand the other questions, and to get some sense of their context; I'd welcome any commentary from experts who think I have not accurately characterized any of the problems or their difficulty.

In the week between the release of the problems and that of their solutions, there was a great deal of work aimed at generating solution attempts, with varying degrees of success and autonomy. Most impressively, Tom de Groot produced a proof of Problem 9 that I believe to be correctly formalized in Lean. This is the gold standard of LLM-aided mathematics; as I understand it he manually orchestrated a number of LLM agents to achieve this. But the median solution was basically nonsense, to my reading.

And shortly before the official solutions were released, Jakub Pachocki at OpenAI announced their attempts, saying that their solutions to problems 2, 4, 5, 6, 9, 10, generated by an internal model with "limited human supervision," have a "high chance of being correct, and some further ones look promising." Models were asked "to expand upon some proofs, per expert feedback." It was rather easy to see that the solutions to 3 and 7 were not credible; I and others quickly realized that the solution to 2 was wrong despite the claim it was likely correct. And Pachocki later clarified that some of the solutions were produced at best semi-autonomously, writing:

Some of the prompts included guidance to iterate on its previous work; e.g. the rollout that produced the solution to #6 was prompted with the problem statement followed by: "Trying [sic] using a BSS barrier type argument. You will have to think hard about the setup and the inductive framework to push it through." This guidance is based on the previous attempts by the model to solve the problem (which previously converged on this approach) and did not originate from the person who prompted the model; however, in a properly controlled experiment, we would avoid such manual prompting.

Here BSS refers to closely related work of Batson-Spielman-Srivastava (two of whom are authors of First Proof). Clearly this does not meet the standard of autonomy set out by the project authors. And unless OpenAI releases full transcripts, it will be hard to understand precisely how much human expertise went into producing these solutions.

For someone looking to dismiss the results, this is the start of a recognizable pattern: hype, then criticism from experts, then a slow walk-back to something still impressive but more reasonable. And yet...

What can existing models achieve?

And yet do we really think we couldn't get an autonomous system to say to itself, "Try using a BSS barrier type argument. You will have to think hard about the setup and the inductive framework to push it through"? This is surely a huge hint, but do we really think this is the obstruction that will prevent some substantial portion of human mathematical labor from being automatable, given that the models can in fact recognize the relevance of this earlier work to the problem? I'm skeptical, and in fact I have seen another proposed solution to problem 6, by Shengtong Zhang and Wilson Lin at Cursor, that seems more clearly autonomous, and that experts seem to vouch for.

Google DeepMind has also released credible solutions to problems 2, 5, 8, 9, and 10, with full transcripts, generated by their scaffolds Aletheia and Deep Think. One of their solutions to problem 7 is terse, though it too is arguably correct---interestingly it was initially classified by mathematicians at DeepMind as incorrect. While DeepMind's solutions were released publicly after the deadline, they were sent to the First Proof team prior to it. Based on released transcripts, these solutions are clearly autonomous; problems 2 and 7 were not solved by OpenAI, and OpenAI did not claim correctness for their solution to problem 8, though its status is unclear. So it seems likely that somewhere between 6 and 8 problems were solved correctly if one combines all attempts.

Some of the solutions (notably, to problems 9 and 10) were, in some variant form, available online. The models still had to recognize this pre-existing work and adapt the arguments therein. How much of the labor of mathematics research has this shape? Quite a bit. A model reliably able to do this would be unbelievably useful.

While it is difficult to be certain exactly how OpenAI's solutions were produced, the comparable performance by Aletheia---a scaffold of publicly available models---made it clear to me that I was underrating the capability that can already be achieved via inference scaling. Christian Szegedy has predicted that models will be mathematically "superhuman in almost all respects" in 6 months to a year. I find that precise timeline hard to believe for most aspects of mathematical research, but I suspect that he won't be off by much when it comes to proving some class of involved statements that would previously have required an expert. This is a narrow conception of mathematics indeed, but it is true that producing such proofs is a large part of math research.

A few days ago, I wrote that "it's clear this technology is at least as impactful as...the invention of the computer, for math research. That's a lot! Thus far it is not totally clear to me it will be more impactful, though I think it's very plausible." It's tempting to read this as reassurance: the computer changed mathematics, but it only automated a small part of mathematics. But let me rephrase: it's very likely that this technology is bigger than the computer.

Where did existing models struggle?

These results, though impressive, have a number of limitations.

As far as I know, there was only one credible formalized solution to any problem, impressively and manually orchestrated by Tom de Groot. This is despite lots of big claims from autoformalization companies about "mathematical superintelligence"; that none of them released solution attempts communicates some valuable information about the status of model capabilities in formal mathematics as compared to informal mathematics. To be fair, many of the problem statements likely cannot be easily formalized in a reasonable amount of time.
The models were quite successful at answering two of the questions with close precedents in the literature, namely 9 and 10. But problem 1 also had a sketch in the literature; it seems that no model was able to fill in the details. I find this somewhat puzzling.
The models (and the humans supervising them) generated an enormous amount of garbage, including some incorrect solutions claiming to be formalized in Lean. Even the best models/scaffolds seem not to be able to reliably detect when they are producing nonsense.
Many correct solutions are very poorly written, and their correctness is exceedingly difficult to check because of this. Comparing the official solutions to the model solutions, there is, in most cases, a remarkable difference. In the former the main ideas, obstructions to be overcome, etc., are usually identifiable; in the latter they are often completely unclear. And in the course of writing their solutions, the human authors often develop useful new objects, terminology, etc. to capture what they're doing, while the models usually just plow ahead. There are exceptions to this though; some of the solutions are quite interesting as written.
As far as I know no team outside of a big lab significantly outperformed GPT 5.2 Pro "out-of-the-box," even with somewhat involved scaffolds. For example, see here for a nice attempt by Althofer and Wolz, whose scaffold seems to have exactly matched the performance of GPT 5.2 Pro (credible solutions to 9 and 10, and a plausible attempt at problem 5 that does not meet the standards expected from academic mathematics). Puzzlingly, they claim that their incorrect solution to Problem 2 was checked by Aristotle, Harmonic's autoformalization agent. (I don't mean to pick on this attempt; it just seems to me to be a representative example of solutions produced with scaffolded publicly available models.) I find it surprising that eliciting the performance of, say, DeepMind's Aletheia scaffold, is apparently difficult---perhaps the inference costs necessary to achieve Aletheia's performance are very large?

What are we measuring?

In some aspects of mathematics, the models have been superhuman for some time. Every frontier model knows more mathematics than any human who has ever lived; every model can solve exam questions faster and more accurately than any human. One might reasonably object: Google, or the collection of all math textbooks, knows more math than anyone who has ever lived. Mathematica or a TI-84 calculator is better at certain computations than any human. Sure, I agree.

The mystery is this: a human with these capabilities would, almost certainly, be proving amazing theorems constantly. Why haven't we seen this from the models yet? What are they missing? The answer is much more obvious for a calculator than it is for an LLM.

For a long time, my answer has been: benchmark questions are not measuring the same thing in models and in humans, because reasoning ability is less correlated with benchmark performance in models than it is in humans.

When you generate a benchmark, or test a model privately, you try to generate some hard questions and then give them to the model. What you actually generate is a question that you think is hard---that is, one that is hard for you or people you know. What makes a question hard is often that there is some missing prerequisite knowledge, or that one has not been exposed to similar problems; then reasoning ability is necessary to fill in this missing context. But this means that model performance can improve by increasing knowledge, without increasing reasoning abilities. There is some ⁵ evidence for this view. In short, because the models have an incredibly vast knowledge-base, they often do not have to generalize very far to solve a problem, whereas a human might have to perform an immense amount of reasoning to do so. It seemed to me plausible that benchmark performance was explained by "IMO Gold"-level reasoning plus increasing knowledge.

While this may be true, maybe it doesn't matter; maybe shallow generalization is enough. If you can generalize a little bit, for a long time, maybe you can do whatever a human mathematician can do.

One of the ways I like to test the models is to give them a hard problem, and then see how long it takes me to cajole/guide/bully them into giving me a correct solution. For a lemma from one of my papers, it is typically quite difficult or impossible to get a complete proof without any hints. In one case I devoted, as an experiment, 8 hours (admittedly some of which I spent away from the keyboard in frustration) trying to get GPT 5 Pro to produce a relatively simple counterexample to some statement without hints. The models do much better if one gives them a hint. Frontier models can often execute arguments I would consider "routine" if one explains the general idea in a sentence or two. It's easy to take this as evidence for usefulness, but against automatability. This is wrong. Instead of saying, it takes 8 hours of human labor, or giving away the main idea, we should say all it takes is 8 hours of labor or the one-sentence main idea.

That's the barrier to automation. Eight hours of back and forth with a model is really not so much to try to automate, and the various scaffolds in development (e.g. Aletheia from DeepMind) demonstrate this to some extent.

GPT-3 proving Fermat's Little Theorem in 2022, with some help.

What I find so exciting about First Proof, by comparison to FrontierMath or my own experiments, is that despite its messiness, it is an excellent proxy for part of the labor of math research. Success on FrontierMath would likely indicate significant usefulness but not necessarily a sea change to the practice of math research. It is impossible to argue that success on First Proof, even if achieved semi-autonomously or by shallow generalization from the literature, does not indicate genuine usefulness.

The near future: obstacles to automation⁶

Truth-seeking

In the near term, we're in trouble. Models are able to produce both correct, interesting mathematics, as well as incorrect mathematics that is exceedingly labor-intensive to detect. Academic mathematics is simply not prepared to handle this.

Humans also produce errors. But so far the system has survived (though it was already under enormous stress prior to the advent of LLMs) due to the fact that human mathematicians are by and large truth-seeking. That it was not clear to OpenAI which of their solutions to First Proof are correct suggests that the same is not true for the models generating them (and many other models/scaffolds also released incorrect solutions---for some, it seems that all the solutions they released were incorrect).

To be clear, I don't think this issue is insurmountable, but surmounting it will be costly in the short-term. For example, we can use AI reviewers to cut down on the most egregious garbage; at present this will be highly unreliable, and likely will result in some correct work getting rejected. We can gate-keep in various ways: require collateral for review, only accept papers from trusted colleagues, etc. In my view these sorts of solutions are counter to the open spirit of mathematics.

In addition to being a serious threat to the profession, lack of truth-seeking is also a serious issue for usefulness. Checking a proof is extremely difficult, as again the discourse around First Proof seems to demonstrate. A model that produces correct proofs of difficult results 50% of the time, and proofs with subtle errors the other 50%, is likely useful, but using it is at best unpleasant.

I do not view this as a serious obstruction to the automation of theorem-proving. At worst, as autoformalization tools improve, we will be able to verify some subset of outputs (though this comes with its own challenges, and again formalization of statements is labor-intensive to verify). At best, models themselves will improve at verification and reliability will increase. There are already signs of this. o3 would almost never admit to not being able to prove something; with GPT 5.2, it happens regularly.

Creativity

Can an LLM invent the notion of a scheme, or of a perfectoid space, or whatever your favorite mathematical object is? (Could I? Could you? Obviously this is a high bar, and not necessary for usefulness.) Can it come up with a new technique? Execute an argument that isn't "routine for the right expert"? Make an interesting new definition? Ask the right question?

Frontier models are not yet performing these tasks at a level comparable to their ability to answer well-posed questions that are accessible to known techniques. I think there is no indication that their doing so is impossible, and some evidence that it is possible. Namely, some of these tasks occur in miniature when one proves a challenging result; one decomposes it into subquestions, tries to understand the objects that appear therein, etc. One can read model chain of thought and see this happening.

This evidence is admittedly weak. There's some question in my mind as to how much of this is "shallow generalization," as discussed above, but it seems plausible the seeds of creativity are there.

There is little sign of models autonomously "theory-building." ⁷ It would be interesting to try to measure this capability (how?). If models do indeed start producing huge quantities of correct, interesting mathematics, then finding the "correct proof," theory-building, and consolidation will become ever more important, to make sense of it all.

I am skeptical that there is any mystical aspect of mathematics research intrinsically inaccessible to models, but it is true that human mathematics research relies on discovering analogies and philosophies, and performing other non-rigorous tasks where model performance is as yet unclear.

Long-horizon tasks

High quality math research takes a long time, sometimes years or decades. It is rare (at least in my experience) for a high quality project to take less than a few months. There exist very few tasks of this length (possibly none!) that models can, at present, achieve reliably. METR estimates that GPT 5.2 (high) can achieve software engineering tasks that take an expert 6.5 hours with 50% reliability. That timespan is, apparently, doubling every 4 months, and there's evidence these numbers can be substantially improved with appropriate scaffolding.

For a long time I have told myself that the the automation of some kinds of high-quality math research would be foreshadowed by the automation of some end-to-end software engineering. I expect this latter will happen fairly soon.

Learning

Mathematics is not static. For example, here are some selected notions of "space" that have appeared in algebraic and arithmetic geometry: schemes (1960), rigid spaces (1962), stacks (1969), Berkovich spaces (1990), adic spaces (1996), diamonds (2017), condensed sets (2021), analytic stacks (2023), Gestalten (2025), ... (Of course I have omitted many.)

In the current paradigm of model training, as I understand it, models are trained on existing text corpora and synthetic data generated from it; their expertise on new objects is much weaker. When a human encounters a new object, we play around with it, learn about it, and develop expertise.

Recent model scaffolds can do this to a very limited extent, by leaving notes to themselves. Is learning the judicious use of markdown files?

Cost

Above I have focused on raw capability, but there is another relevant dimension here, namely cost.

GPT 5.2 Pro costs $$168$ per 1M tokens of output in the API; a single query is limited to 128,000 output tokens (including reasoning), so can cost up to ~$$21$. Presumably the internal costs to OpenAI are some constant fraction of this. One way performance improvement could occur is by increasing capabilities given fixed effort, circumventing each of the problems discussed above directly. Another way is to increase effort, as DeepMind's Aletheia scaffold demonstrates (I think we don't know anything about the model/scaffold used by OpenAI, so it is unclear to what extent its improvement comes from inference scaling vs. better capabilities at fixed inference).

It is not clear what the limits to performance for such scaffolds are, given enormous time and compute. Practically, they are limited by how much we are willing to spend. Were the per-token cost of existing models to decline substantially, we would already see substantial improvement at fixed cost.

I have been considering all of these obstacles to automation for some time. I think it likely that if they were, collectively, insurmountable, they would obstruct capabilities that we now see in existing models.

Since the middle of last year I have been predicting that models will soon begin to autonomously resolve "mildly interesting" open conjectures, say by late 2026. In my view this prediction has not yet come to pass, despite a number of leading indicators. Whether it does will be a good test of the seriousness of whatever obstructions to automation remain.

The library

Suppose the obstacles discussed above are overcome (as seems to me increasingly likely). Suppose we live in a world where we can, for a modest cost, automatically produce a paper comparable to the best few papers published this year. Suppose automated researchers running in a datacenter are churning out many such papers per day.

What will human mathematics research look like?

Let us take this to an absurd extreme. Suppose we had a library filled with proofs of every theorem of ZFC, as well as excellent guides that could, given a question, take us to the answer and explain it. What would a mathematician do in such a library?

If you ask the question this way, the answer becomes clear: they would be unbelievably excited, and immediately get to work. They would immediately start asking questions: how does one prove the Riemann hypothesis? The Hodge conjecture? Their own pet obsession (in my case, the Grothendieck-Katz p-curvature conjecture)? Then they would work until they understood the answer. The job would not be done, not even close.

I do not mean to suggest, even, that humans necessarily have an intrinsic edge in asking mathematical questions that are interesting to humans; that is certainly the case now (and I suspect it will be for some time), but I see no principled reason it should be true. I just mean that this is why we got into mathematics: we want to understand. That's the goal.

Stepping back, of course we will not have access to such a library. We will, in the near term, have access to powerful tools that can prove many results that would previously have required an expert. The difficulty of the results that can be proven autonomously will increase, though perhaps not at the same speed in every area. Eventually they will, most likely soon, be proving results that experts have tried and failed to prove.

This will change the practice of mathematics dramatically, and the shape that change will take seems quite sensitive to the shape of our new collaborators' intelligence. I'll speculate on how we might try to adapt to this change in the medium-term in a future essay. But the following seems to me to be less dependent on the precise path of technological development:

Will models have absolute advantage over humans in all aspects of mathematics research? (That is, not just proving theorems, but the messy intangibles of taste, creativity, theory-building, philosophy, etc.) Presumably the answer is "yes" on a large enough time-scale, but some of these skills seem challenging both to measure and to train. What, besides curiosity, will humans have to add? It's quite plausible that there will be a long tail of mathematics research skills where humans retain advantage for some time. Like most other professions, there will likely be, in the medium term, many bottlenecks that cannot be automated.

Will models have comparative advantage over humans in all aspects of mathematics research? There exist mathematicians who are much more talented than I am. And yet I am still producing results they have not produced, and that are, in my view, interesting. In part this is because the attention of such mathematicians is a limited resource; in part, I think, it is because different mathematicians have wildly different interests, instincts, approaches to the subject, etc. I see no reason this situation should not persist once we have access to incredibly powerful AI mathematicians. Presumably the attention of such will be a much less limited resource. But mathematics is very, very large. And as we explore it, we will open up more and more undiscovered surface area.

Of course it is possible that, if we want to know something, we will start by asking a computer and it will either tell us the answer, or get stuck. And that we will have a strong sense that, if it can't find the answer, it's very unlikely we'll be able do so ourselves. When has that ever stopped us? What is the life of a mathematician but being stuck on a hard problem?

Acknowledgments

I am very grateful for comments from Mohammed Abouzaid, Kevin Barreto, Greg Burnham, Frank Calegari, Tony Feng, Peli Grietzer, Stephanie Koh, Mark Sellke, Ravi Vakil, and Kevin Weil.

Footnotes

^¹ Perhaps it should have changed when it was observed that "think step by step" improved performance.

^² I will mostly focus on performance of OpenAI models, since I have found these to be the best for my use cases. But Gemini, Claude, and the various scaffolds that use those models, are not so far behind.

^³ I find solutions to Erdos problems and so on to be an exciting leading indicator, but here I am referring to things a step or two above this.

^⁴ Peli Grietzer has generously pointed out to me that my expectations were plausibly on the mark for publicly available models and scaffolds. This might be right, but on the other hand there's nothing (except cost) stopping anyone from building a scaffold comparable to, say, DeepMind's Aletheia. I find it puzzling that, to my knowledge, no scaffold by a team outside a big lab significantly outperformed ChatGPT 5.2 Pro out of the box, which was capable of producing credible solutions to problems 9 and 10, and arguably to problem 5.

^⁵ This is a summary of Deep Think's performance on the FrontierMath benchmark in October 2025. Problems on FrontierMath are rated on a scale of 1-5 along the axes of required Background, Precision, and Creativity. We see here that Deep Think's performance correlates negatively (if not strongly) with the required background rating, but appears to correlate little with required creativity. According to Greg Burnham at Epoch AI, this pattern is common to the performance of all models.

^⁶ This essay is framed around the potential for autonomous high-quality math research. I am convinced that the obstacles to automating the job of a mathematician are no different from the obstacles to automating other professions that take place in front of a computer; our work is neither easier nor harder to automate. The obstructions I list in this section are not special to mathematics.

One sometimes sees the argument that mathematics is uniquely vulnerable to RLVR (Reinforcement Learning with Verifiable Rewards). I think there is little evidence for this claim; it rests on an elision of the difference between mathematics and formal mathematics. At present model performance is in fact best in informal mathematics, where verification is arguably as much a matter of judgment as it is in other fields.

^⁷ I am aware of one claim to the contrary; if one examines the paper in question the mathematics contained therein boils down, after unwinding definitions, to multiplying two power series.

betalog

The future

First Proof, FrontierMath, etc.

What can existing models achieve?

Where did existing models struggle?

What are we measuring?

The near future: obstacles to automation⁶

Truth-seeking

Creativity

Long-horizon tasks

Learning

Cost

The library

Acknowledgments

Footnotes

Latest Posts

Search

blogs I like

Tags

categories

betalog

The future

First Proof, FrontierMath, etc.

What can existing models achieve?

Where did existing models struggle?

What are we measuring?

The near future: obstacles to automation6

Truth-seeking

Creativity

Long-horizon tasks

Learning

Cost

The library

Acknowledgments

Footnotes

The near future: obstacles to automation⁶