Can artificial intelligence do mathematics yet?

As 2024 draws to a close, I’m reflecting on some of the biggest news that has happened across the beats I cover for New Scientist, in physics, space, and technology.

First, AIs increasingly capable at maths.

AIs that can do more maths

In April, I wrote a feature about the quest for AIs capable of mathematical reasoning. The thinking, from both computer scientists and mathematicians, was that AIs that can do proper research-level maths will also be AIs that can reason in complex, creative and useful ways. This might be similar to the kinds of AI that people hope will exist when we achieve artificial general intelligence (AGI), if you believe such a thing is possible. As Alex Davies, who works on Google DeepMind’s AI for Maths initiative, told me: “Mathematics is the language of reasoning. If models can learn to speak it fluently, we will have created a very worthy intellectual partner.”

Many of the people I spoke to while writing that piece felt that, while there were some interesting AI tools for maths (both established research techniques and newer LLM-type models), we were still a long way from anything resembling a professional mathematician. Looking back, I still think that’s largely true. We’re not going to see a Millenium Prize-level problem, like the Riemann hypothesis, proved by an AI any time soon. But it’s worth taking stock of some of the advances we’ve seen in the past 12 months.

In July, Google DeepMind announced it had come up with a pair of AI systems that, together, were capable of getting silver at the International Maths Olympiad. The IMO is considered the world’s most prestigious competition for young mathematicians, and scoring a medal-level performance on it has long been seen as a real litmus test for whether AIs can do mathematical reasoning. You can certainly argue that the proofs in it are far below research-level mathematics, and that you can practice and get some of the way there by looking at lots of past exam problems, but it’s still an impressive feat, and shows that these systems are doing something interesting. Whether we can call that reasoning, in the way that humans do, is what people will argue over.

We’ve also seen the rise of “thinking” LLMs - that is, LLMs that use more computing power and time to process the prompt they are given, rather than just quickly query an already trained model. This resulted in the o1 family of models that OpenAI announced in September. I recently spoke with Noam Brown at OpenAI, one of the key researchers behind o1, and he felt strongly that this was going to help models ascend over the pretraining plateau (diminishing returns from throwing vast amounts of data and computer power at ever-larger models) that everyone had been predicting.

Then, just before Christmas, OpenAI announced o1’s successor, o3 (to reportedly avoid a potential copyright dispute with British mobile company o2). We don’t know exactly how o3 is different to o1, under the hood, but in terms of its performance, it seems able to solve many more mathematical problems that AIs simply couldn’t before, including scoring highly on certain mathematical benchmarks that all other AIs had performed miserably on. There were, of course, caveats, like the vast computing power and cost required to solve these benchmarks (possibly up to a million dollars for ARC-AGI), and their relevance to practical mathematics. FrontierMath, another of the benchmarks, is mostly composed of numerical problems, because they are easy and quick to verify that the AI has solved them. Most hard mathematical problems don’t have simple numerical answers you can score for a benchmark.

But despite these caveats, it was previously assumed that AIs just couldn’t do well on these sorts of tests. The capability of these models is certainly growing. In a recent blogpost about o3, mathematician Kevin Buzzard at Imperial College London wrote that he was “shocked” at how well it had done on some of these benchmarks. He then went into some of the caveats above and tempered his excitement, but I think there is genuine interest at how well o3 has done amongst mathematicians.

I’m sure we’ll see much more of these “thinking”-type LLMs next year (Google has already announced its own version, in the form of Gemini 2.0 Flash Thinking). Whether they can help at all with proper mathematical research, or reason like a human, is what interests me. I suspect not, but it’s a fascinating development.

There has also been some interesting work on using AI for research mathematics, but as a tool, rather than as a human replacement. François Charton at the Fundamental AI Research (FAIR) team at Meta has released two tools, one called PatternBoost, which appears fairly versatile when it comes to solving combinatorics problems, a wide class of mathematical problem (though still very limited with respect to all maths). The other helps solve a class of problem called Lyapunov functions - a famous example is the three-body problem. These are limited, for now, but they show the increasing willingness with which mathematicians (though Charton considers himself a computer scientist) are approaching AI.

Can artificial intelligence do mathematics yet?

Want to read my next post?