ICAEW.com works better with JavaScript enabled.

Testing Gen AI platforms: can they do audit work?

Author: ICAEW Insights

Published: 24 Sep 2025

Gen AI is an increasingly common component of an auditor’s toolkit. But is it good at doing actual audit work? Helen Pierpoint, Audit and Assurance Faculty Technical Manager, tests three Gen AI models.

Most firms remain in the early stages of AI adoption. Simple Large Language Model (LLM) chatbots are being used by auditors to enhance their understanding of entities and their environments via research, and to generate efficiencies. But for now, it seems that the case for using the likes of Copilot, ChatGPT, Grok or Claude to support identifying the risks of material misstatement or calculating materiality has not been made. 

I decided to see how good these models could be at performing certain audit tasks. Key questions that need to be answered when it comes to Gen AI and audit include:

  1. Can these models perform audit tasks well enough that a reasonably experienced auditor can use them to build upon?
  2. How do models compare in terms of quality, accuracy and reliability of output?
  3. Where do these chatbots fall especially short compared with a human? Or could they even perform better?

Some acknowledgements

AI technologies are in a constant state of flux – since this test was completed, a more advanced version of Grok has been released. Also, models interpret prompts in different ways. Varying the wording of a prompt can result in very different responses from different models. 

Setting up a company and creating fake audit information

A fictional company and financials were used for this test. This was also AI generated: Copilot created a fake retailer that it called Majestique Limited, specialising in luxury fashion, accessories and jewellery. ICAEW guidance on prompts and a free Gen AI prompt enhancer (MaxAI) was used in crafting the prompts.

I needed to provide some detail on the rationale for materiality calculations and more information in relation to the format of the output. I asked Copilot to include a series of red flags that indicated the financial statements might be materially misstated. This is what it came up with: 

  • Revenue has declined year-on-year, yet the gross and operating profit margins have improved – apparently due to effective cost management from a new finance director. 
  • The source of gold for certain items of jewellery bearing the company branding was an illegal mine in Venezuela with links to human rights abuses. This has led to a decline in gold jewellery sales.
  • The US accounted for 16% of the company’s sales. 
  • Sales for the last month of the reporting year were 10% of the total sales for the year. This was attributed by management to greater customer demand at Christmas. 
  • A significant dividend of £35m was paid. 
  • Finance staff turnover was high. 
  • A shareholder’s wife works part-time for the company as a ‘technical adviser’. 
  • The inventory impairment provision decreased in comparison to the prior year. 

The models tested

The test was performed on three models:

  • Copilot (Microsoft)
  • Claude Sonnet 4 (Anthropic)
  • Grok 3 (xAI)

I used Grok and Claude instead of ChatGPT; Grok was less familiar, and Claude is recognised as being particularly good at reasoning and in-depth analysis.

Response time

Copilot provided the full report ridiculously quickly – under 20 seconds. Grok took a little longer, at one minute and 15 seconds. I was able to go and make a coffee and come back while Claude was still processing my prompt (three and a half minutes). 

Analytical review and risk of material misstatement

None of the models managed to round all of the year-on-year percentage changes correctly or consistently. Copilot was the worst offender, incorrectly rounding the percentage changes in three accounts. Grok provided a summary of implications of the drivers of change, with suggestions of alternative drivers, and what this could mean for the audit risk profile. 

None of the models performed risk assessment procedures in the way that a human auditor would. They could not supplement their understanding of the entity’s environment through management inquiry, observation and inspection (as per ISA 315 Para. 14). They all referenced ISA 315 in their risk identification and assessment, but none made the distinction between financial statement-level risks and assertion-level risks.

Despite revenue recognition being a presumed significant risk of fraud as per ISA 240, Copilot did not deem revenue recognition to be a risk area at all. Claude did cite “presumed fraud risk per ISA 240’’ in relation to management override of controls and revenue recognition, but with no explanation as to why some were significant and others not.

Grok, however, directly tied identified risks to assertions and relevant ISAs. It also highlighted going concern as a risk. 

Materiality

There was a lot of variation in the models’ respective financial statement materiality figures. 

Claude suggested the lowest figure with profit before tax as its benchmark. Its rationale for the benchmark and the factor to be applied were anchored in ISA 320. It considered key performance indicators (KPIs), the nature of the entity, its lifecycle, and financing (as per ISA 320 A4): “For luxury retail businesses, profitability metrics are KPIs used by investors and management. Despite revenue challenges, company has demonstrated profit growth and appears to be a stable, established business. The significant dividend payment demonstrates that profit is a key focus for shareholders.”

It applied a factor of 5% to the benchmark of PBT, which is in line with ISA 320 A8. 

Copilot opted for a revenue benchmark. Its rationale for this was vague, suggesting that revenue was “stable and less susceptible to manipulation’’ than operating profit without clear evidence. It’s possible that this was a hallucination.

Copilot applied 0.75% to the benchmark, saying it was within a “common threshold’’. It had not considered factors that might prompt the auditor to modify the factor, for instance whether the entity is a public interest entity.

In keeping with ISA 320, all three models were quite good at considering their understanding of the entity, the risk assessment, and prior period misstatements when calculating performance materiality.

According to the FRC’s 2017 Thematic Review of Materiality, most reviewed firms tended to prescribe a performance materiality percentage range of between 50% (high risk of aggregate uncorrected misstatements being material) and 75% (low risk). Claude suggested a mid-range percentage of 60%, suggesting moderate risk. Given the models had no information about prior period misstatements, it was surprising that Copilot did not presume no history of misstatements, assuming this moderate risk. 

Claude proposed the highest percentage of all three: 75%, low risk. It said that factors such as the “long-established audit relationship’’ with “no significant history of misstatements’’ and “no explicit evidence of control deficiencies’’ supported this. It missed that a new FD and high finance staff turnover might indicate control weaknesses. It also assumed that a long-standing auditor relationship automatically meant lower risk. 

This experiment shows that while Gen AI can produce a polished-looking report extremely quickly, it cannot currently do many, if not most, things a human auditor can do, such as developing deep knowledge and understanding of the entity in question. Although there were some reasonable assessments of significant risks and materiality calculations, the fact that they all differed depending on the model shows that human judgement is required to interpret the output and decide upon the risks and the most suitable materiality thresholds to apply. 

Of course, we let the models loose on fictional data; if a real company was used, perhaps the GenAI’s digging into company history, records or media coverage would come up with more information to fuel its assessment of the entity and its environment and hence risks.

ICAEW has a dedicated Generative AI Guide to help members understand how accountants are harnessing this technology to facilitate their work (‘use cases’), as well as its risks and limitations.

Real-world AI Insights

ICAEW's Annual Conference 2025 includes sessions covering how AI is already being used and how to address the challenges of implementation.

You may also be interested in

Support
Computer screen with text relating to generative AI
Accounting Intelligence

This content forms part of ICAEW's suite of resources to support members to build their understanding of AI, including opportunities and challenges it presents.

Support on AI Masterclass videos
elearning
GenAI Accelerator

Gain the skills to harness the power of GenAI, transforming the way you work and positioning yourself as a leader in the industry. Don't just keep up with change - drive it.

Find out more Enrol
ICAEW support
A person holding  a tablet device displaying various graphs
Training and events

Browse upcoming and on-demand ICAEW events and webinars focused on making the most of the latest technologies.

Events and webinars CPD courses and more
Open AddCPD icon