Artificial intelligence (AI) has become a part of everyday life. It’s now integral to our phones and the software we use for work. Whether you’re an Apple, Microsoft or Google user, you have instant access to Apple Intelligence, Microsoft Co-Pilot, Google Gemini and ChatGPT at your fingertips and, in turn, to the large language models (LLMs) they contain.
Prefer to listen?
This audio file was produced by AI and has been adapted from the original article for audio purposes.
These models have changed how we work forever, from making simple tasks swifter to generating new ideas for projects in seconds. While concerns have been raised about GenAI models producing text that contains errors or mistakes, can LLMs be trusted to help accountants with spreadsheet-based work?
The answer is yes and no, according to Dr Simon Thorne, a Senior Lecturer in Computer Science at Cardiff Metropolitan University, who has published a new study testing major LLMs on how they manage to execute spreadsheet functions, formula generation and data manipulation tasks.
“The performance of the models is very fractured across different tasks,” Thorne said, adding that some models would excel or do better in certain performance or programming tasks but then completely fail in others.
Benchmarking LLMs’ abilities
His study is an update on tests made in 2023, when ChatGPT version 3.5 had been released. Thorne had been impressed with the model at first, but after he started using it he noticed answers that didn’t seem right. “Once you start probing, you start to realise these things are called hallucinations and they happen quite frequently,” he said.
The new study was designed to create a benchmark for LLMs akin to those used in aptitude tests for people, but Dr Thorne acknowledges that the speed at which LLMs develop means these tests could return different results in the near future.
He used several tests across a range of difficulty. Those at, the ‘easy’ end of the scale, included giving models spreadsheets with errors that have been inserted into the formulas and data to see if the
LLM will pick it up. At the ‘complex’ end, they tested a model’s reasoning ability and whether it could produce formulas.
Spotting errors
Even on the simplest end of the scale, where Thorne asked the LLMs to check a basic profit and loss statement that had errors added to it, he found it was “very variable” between different models. “Even models that picked up some of the mistakes didn't pick them all up,” he confirmed.
A more advanced auditing task called the triangle spreadsheet was then used as a test. This involved taking three inputs relating to three sides of a triangle that will be determined as an equilateral, isosceles or scalene triangle depending on the lengths stated.
“Now this is hard because all of this is implemented in ranged names, so it’s a level of abstraction away, or at least, humans find it quite hard,” said Thorne. “The results from that one were quite unsurprising in the sense that [the LLMs] couldn't find most of the mistakes and didn't even notice some of them.”
Mathematical challenges
The LLMs had around a 50/50 split when tested on more complex mathematics, such as the wall and the ball test, which involves calculating how much helium is needed to get a hot air balloon off the ground while safely factoring in any passengers.
However, when he moved onto entropy calculations, such as the entropy of binary strings, which are complex and multi-step, the models struggled. “A lot of them got the first calculation correct, but after that almost none of them got the other two correct.”
Finally, Thorne tested the LLMs on a Latin square puzzle with five variables, which he had used in his first paper and was the inspiration for creating a benchmark test. The grid-based puzzle is filled with a value, such as a number, letter or word, that appears once in each row and column.
When Thorne tested the LLMs on the puzzle in 2025, he was surprised to see that it could now solve the problem. “The only conclusion I can come to is that the paper I wrote must have been part of the training for whatever version we are now working with, or it's somewhere in the language corpus.
“So by making a puzzle that's novel, ie taking exactly the same idea but changing the context completely with different names, activities and contexts, [the LLM] showed that it can't actually do it.”
Which LLMs impressed
For now, Thorne has been impressed by the improvement of Google’s Gemini, which “prior to eight months ago was garbage,” he said. “I wouldn't use it for anything but the newer versions of it are really much better and outperformed every other ChatGPT model in seconds.”
One model that surprised Thorne was Cohere, which is one of the original LLMs, while Microsoft’s CoPilot, Chinese-owned DeepSeek, and X’s Grok did not perform well.
CoPilot came at the very bottom of the table overall, beating only ChatGPT04 mini, the basic version of GPT04. Considering that CoPilot is the native AI to Excel, that is surprising.
Overall, Thorne found even the best LLMs only got around two-thirds of the tasks right, “so there’s a lot of error in all of them”. It means that however helpful an LLM can be with spreadsheet or programming tasks, producing code, formulas and auditing information, the results must be checked for accuracy and feasibility. “It still has utility, but it has to be used very carefully,” he advised.