Host
- Philippa Lamb
Guests
- Simon Thorne, Senior Lecturer in Computer Science, Cardiff Metropolitan University
- Mike Miller, Economic Crime Manager, ICAEW
- Martin Wheatcroft, adviser on public finances, ICAEW
Producer
- Natalie Chisholm
Transcript
Philippa Lamb: Hello, and welcome to Accountancy Insights. Three topics today, and first up: AI. We’ll be looking at the best and the worst large language models for spreadsheets and finding out if any of them are truly dependable.
Next, with Companies House ID verification just around the corner, what are the key ‘need to knows’ for accountants and for company directors? And our third item, well, that is something of an explainer. We all know how important the Office of Budget Responsibility is; its economic forecasts are always in the news and doubtless cost the chancellor a good night's sleep from time to time. But how much do you know about the OBR’s precise function and how it actually works with the Autumn Statement next month? It feels like the perfect time to find out.
And just before we start, a reminder that the time you're about to spend listening to this podcast, it counts towards your annual CPD. So log your listens on the ICAEW website (it is very quick) and subscribe to the podcast on your preferred app so that you can take advantage of every single episode for your CPD record.
Spreadsheets and AI – 01’06”
PL: Let's talk about spreadsheets and AI. Simon Thorne, Senior Lecturer in Computer Science for Cardiff Metropolitan University, is joining us from Cardiff to talk us through the research he's been doing on this. Hi Simon.
Simon Thorne: Hi there. Good morning.
PL: So tell me about this study you've been doing. What were you hoping to find out?
ST: I was trying to understand what the real value of LLMs are in spreadsheets. This is a topic that I've been interested in since ChatGPT first came about and was a major release. What I was interested to know was: how good is this at doing real-world tasks? There are plenty of benchmarks out there for spreadsheets, but they tend to focus on a very narrow prompt/formula-type structure. And I was interested to know beyond that, could it do more complex work, a more realistic kind of work? And that was the idea behind this.
It was also to consolidate work that I'd done in 2023 for EuSpRiG, which is the European Spreadsheets Risks Interest Group, where I presented an initial examination of ChatGPT for different spreadsheet tasks, and found that its performance wasn't very consistent, and in fact, it failed many of the tasks. So I suppose a lot has happened since then, and this was an attempt to try and encapsulate that and understand what the real value might be to people who want to leverage this technology to do their work.
PL: You looked at it, as you say, a couple of years ago. This is a field that moves fast. What did you expect to find this time? Did you think things would be much better?
ST: Certainly, I thought that things would have moved forward. I could tell from interacting with ChatGPT and other LLMs that their general ability to respond to queries and more complex queries had definitely improved. You could also do things that you couldn't, such as link a document directly to it. So obviously spreadsheets can be loaded directly into it. So I was interested to know what capability it had to perform a more holistic analysis on that document, and were there any limitations to how that is technically implemented.
PL: Obviously, we can't go into all your findings right now, but do you want to just run us through: what were the major issues that you're still seeing?
ST: In essence, the issues are really still the same from my earlier paper. They're better at answering in a more holistic way, but I tested a number of different approaches. So for instance, there's a task that's in the academic literature called the wall task, and this focuses on a very, very simple spreadsheet creation type task. When I tried this in 2023 with the initial versions of ChatGPT, it was utterly unable to answer in a coherent way. When I tried it again in 2025, I found that almost all of the LLMs – I had a whole panel of different LLMs – could answer that correctly.
PL: So that's encouraging.
ST: Yes, absolutely. However, when I started to introduce more novel work to it, it was less able to respond, and broadly speaking, across the range of different LLMs, the abilities of these models is quite fractured. So some are good at doing certain things, like text handling. Others are better at creating formulas. Others are better at, say, checking a spreadsheet for errors, but even within that, there's quite a bit of inconsistency.
I had a spreadsheet, a very simple Profit and Loss type spreadsheet, with some mistakes that I'd intentionally put into it. For instance, I'd omitted the cost of goods sold in the profit calculation in one part, I'd put a data entry error into the fixed costs of another.
PL: So quite realistic errors that might happen.
ST: Yes, that's what I wanted to go for, yeah, and it was patchy at finding those things. So it may spot a mistake in one part of the spreadsheet, but it wouldn't spot exactly the same mistake elsewhere. So it's inconsistent, is what I found. And why is that? Well, I think that inconsistency comes from perhaps limitations on how it reads the file that you give it. It may not read the entire file. That’s the conclusion I came to, that potentially, it has a certain hard limit that isn't obvious to the user, as to how much of that file it can actually consume. And it gets to a limit and it stops, but it doesn't necessarily tell you that.
PL: And is there also an issue around the way these models are trained, and since they're kind of being taught to the test. Is that right?
ST: That's correct, yeah. So the LLM benchmarks are out there as a measure of success for LLMs. However, what's happening now is that these companies are training to these benchmarks, so it's slightly unsurprising that they do well in these tasks, and it erodes their value. I think we're seeing the same thing with the wall task, in as much that I ran it back in 2023. I think it's highly likely that that will have been picked up and incorporated into the training. And hence, when it comes to it in 2025, it has a much better way of answering. Novelty is a problem for it. So, when I present something novel…there's an extension of the wall task called the wall and ball, which is a much more complex calculation. It's about filling up a hot air balloon with helium and calculating surface area and, you know, complex calculations like that. So when I gave it that test, it didn't perform very well at all. And I assume that is because it hasn't seen anything like this before. This is how LLMs work.
PL: So back in 2023, which were the best and the worst, then?
ST: In 2023, I only tested ChatGPT 3.5 because it was the only one available.
PL: So now then, the most important question of all: which are the best and the worst?
ST: Now, very interestingly, the best model was Gemini 2.5 Pro, as it was called at the time. Now, I say surprisingly, because Gemini had been the worst performer around the time that I did the testing. In second place was ChatGPT, at the time, 4o. All of that's changed now, of course. So ChatGPT have, you know, released a new model, and you can't reach ChatGPT4 anymore.
PL: Yeah, there's endless iterations. It's confusing, isn't it?
ST: It is. So, the other thing to say is that things change very quickly from release to release. And that was part of the thinking behind this benchmark; to have this as a modular, repeatable test that we could use to come up with some objective judgment on which is best for this kind of work.
PL: But CoPilot, which is obviously the one, you know, it's the one a lot of people use that performed really poorly across all the tasks. Is that right?
ST: Yes, on the whole, it did very poorly. And to be honest, this is my experience of it in general. I think what's happened there is that it hasn't changed very much since the first release of it and everything else has -
PL: So they haven't invested in it effectively?
ST: I believe that's right, yeah. I don't think it's been redeveloped. And if you look at all the other providers, they've released multiple versions of their LLMs, and there's real noticeable improvement in that. CoPilot has improved a little bit in that period, but to me, it feels like it's still that kind of early model and is less capable than the others.
PL: So as you say, Gemini is out in front right now. Obviously, they had their difficulties last year, and presumably they've invested heavily in improving it. I suppose it's a question of whether they're going to maintain that lead, isn't it?
ST: Yes. I mean, that will be interesting to see. I understand that Google plans to release the new model of Gemini, if it's going to be called Gemini, next week, so there'll be three point something out next week, and again, it'd be very interesting to see how that shapes up.
PL: Actually, by the time we release this podcast, that might be this week. So, imminently.
ST: Imminently, yes indeed, absolutely.
PL: But you've not seen that.
ST: I've not seen the advanced version, no. I haven't seen that yet. But 2.5 is very good. It's perhaps the best.
PL: So that would be your recommendation, right now?
ST: Yes, right now, that's what I would recommend.
PL: But what are the lessons here for accountants? Because none of them are stand out excellent across the board.
ST: The things you have to keep in mind when you're using these things: number one, they can definitely save you some time, but you need to reinvest some of that time to validate and verify what the output is and to make sure that it hasn't missed anything, and that all the statements coming out of it are correct. So I would never trust these things blindly. You know, they can be useful assistants, and they can do some of the work for you, but I would never assume that it's going to be fully comprehensive or fully accurate.
PL: And there's some issues emerging around the way these models handle your data.
ST: Personally Identifiable Information should never, ever be put into any LLM. The risk is that information will be trained on, and that information will come out somehow in a later release of a newer LLM that's based on that data.
PL: I mean, this is going to surprise people, because obviously in the settings, there's often the opportunity to say: no, you don't want that to happen. But are you suggesting we can't really rely on that?
ST: Well, I think it's unclear exactly how that works. I wouldn’t personally ever do that, even if it says ‘we won't record your information’. I think there's a chance you know, if it's out there, it could potentially be used.
PL: And the phrasing in the terms and conditions is quite vague, as I understand it.
ST: I believe when you dig down into it, it sort of says that it won't usually be used for training. If you go into ChatGPT as well, there's a way. Deep in the menus, there's a tick box which you can tick which says: don't train from my data. But I personally don't trust that a great deal. And there's also a sort of safe mode as well that you can use, which says ‘forget this conversation’. But again, if you look at the details, it says: ‘Oh yeah, it's deleted, but we might keep it for 24 hours’. So it's a little unclear, exactly. LLMs are very hungry for more written discourse because that's how they work, and the size of these models means that there's almost not enough human language to train on. So anything is valuable.
PL: Essentially, these things, they save a lot of time. Obviously, everyone's going to be using them. What's the best advice right now?
ST: I think the best advice right now is definitely use them, but you must check, you must validate, and you must verify. If you do those things, then you can be confident in the output and you can reap the benefits of the time it can save you, but be careful with it as well. So I feel like it kind of shifts us from being the primary workers to more the supervisors of these LLMs, and we must be sure that it's right, because we wouldn't want to make those kinds of mistakes.
PL: Thanks, Simon. We will link to your study in the show notes, but thanks very much for being with us.
ST: Thank you very much.
Companies House ID Verification – 13’28”
PL: On to Companies House ID verification. Now we've talked about it on the podcast before, but it arrives on November 18. So ICAEW in-house expert Mike Miller is going to remind us what you need to know. Hello, Mike.
Mike Miller: Hello, thank you for having me.
PL: It's imminent. Who does it apply to in the very first instance?
MM: Initially, it applies to directors of companies, members of LLPs and persons of significant control of companies. So with PSCs, there are some plans to broaden it somewhere down the line in terms of corporate members of LLPs, company secretaries, etc, in order to basically encompass everyone who has some sort of influence over the running of a company. But we're starting with the obvious people – the directors, the PSCs and the members of LLPs – from the 18th of November.
PL: So how does this ID verification work in practice? What do directors actually need to do?
MM: So for directors who are already registered with Companies House, there's essentially a one-year implementation period from the 18th of November. But for anybody who's wanting to establish a company and register with Companies House, they will have to do it immediately, essentially, to be able to fulfil their duties as being part of the register.
So there are a couple of ways of doing it. The first and probably easiest way, if you don't have an established relationship with either an accountant or a solicitor, is to essentially register on the government's own website; to do your ID verification, provide a primary source of ID, which is generally going to be a passport or driving licence.
If you don't have a passport or a driving licence – we know that's a bit of a concern for older people, particularly if they don't travel or if they don't drive – there are other ways of doing it, through secondary identification measures, such as your birth certificate, utility bills, etcetera. But essentially, you need to comply with this if you're going to establish a company. You need to do it as soon as possible.
PL: Thinking about accounting firms, is this mostly about them needing to remind their clients to get verified?
MM: Yes, I think so. If accountancy firms have clients who are directors or PSCs of companies, then they should definitely be at least knocking on the door and saying: ‘look, you need to do this’. Of course, if they're existing clients, they're probably already established within Companies House, so they will have this one-year grace period. But if anybody contacts their accountant and says: I'm thinking of establishing a company, they will have to do it imminently, because essentially, you won't be able to register a company on Companies House unless you complete your verification checks from the 18th of November.
PL: What do they need then, in order to file on behalf of clients?
MM: At the moment, it's completely fine. You can continue filing as you would do for your clients in the future, as it is told by a company's house at the moment, from Spring 2026 for companies to file on behalf of their clients, they will also need to be registered as an ACSP – an Authorized Corporate Service Provider – which means that they have to go through the verification checks themselves. And this leads onto a few other technical challenges in terms of doing verification, whether you verify for your clients or whether you're just filing their accounts.
But from 2026, it is expected that anybody who files accounts should be registered as an ACSP, and they're going to need director's codes too. They will have to fulfil the obligations under the director's codes. The whole idea of this is essentially to get a unique identifier for anyone who registers on Companies House.
PL: So this is every director, Mike?
MM: At the moment, it's every director who would be determined to be in control of a company. It doesn't cover, for example, large companies and those who have directors in their title, although that is something that is being explored by Companies House and may come in further down the line.
This is all done by secondary legislation, so Statutory Instruments that have been laid in Parliament, it takes a while for these things to go through. And of course, the level of responsibility of a director in quotes of a large firm can vary very much depending on essentially what their responsibilities are and what they're doing.
So I think that's a bigger challenge that is going to be explored by Companies House down the line. At the minute, they're really trying to…because this is such a large change for Companies House, and it requires a huge amount of resource and it requires a huge amount of attention to essentially plug the gap that has existed for quite a long time, in terms of being able to register these companies. So it's a bit-by-bit process to get it to a level of compliance that's desired by government.
PL: So thinking about ID verification right now, am I right in thinking there's some confusion between AML supervision and ACSP verification?
MM: Yes.
PL: What's going on there?
MM: That is an issue. ICAEW currently is a supervisor for AML and CTF counter terrorism financing, which comes under the anti-money laundering regulations. Now, Companies House has determined that anyone who wants to register any firm, any person and the accountant who wants to register as an authorised corporate service provider, has to be supervised by someone under the AML regulations, which I'm assuming is done by them, to get a sort of baseline of compliance.
But the difference is quite stark between what is required under the AML regulations, which is essentially a risk-based approach, so you don't need to check everybody's documents thoroughly. What you assess is the level of risk for that particular person that you're supervising. And ID verification, where you do need to legally go through everything and make sure that they comply; that it's a genuine driver's licence or it's a genuine passport.
If you don't have the expertise, then you need to either use an automated system, or you need to have some training in the ability to determine that it is, which also raises some complications about, for example, overseas directors. We've had this since the register of overseas entities was established a couple of years ago, as it can be very difficult to verify overseas documents if you don't speak the language, or if you're not familiar with what the traditional forms of ID are from other jurisdictions, for example. So I guess the underlying message is: just because you comply with the AML regulations doesn't mean you're doing verification to the necessary legal standard.
PL: And it goes out saying there is comprehensive guidance about all this on the ICAEW website.
MM: There is. We will be doing more and more, both from my side and from our Professional Standards department, which has put out quite a lot of this guidance. It’s Professional Standards who oversee the AML supervision for now, although that is potentially changing because of the government announcement yesterday. So we have put out quite a lot of guidance, we will continue to put out a lot of guidance, and we will be advising firms who want to set themselves up as ACSPs. Just because you're an ACSP doesn't mean you have to offer verification, though we expect people will establish themselves as ACSPs in order to file accounts for their clients. But that doesn't necessarily mean they're going to take on new clients just for the purposes of verification.
PL: So the website for more detail, and the start date: November 18?
MM: Anyone who wants to register and establish a new company from November 18 will be required to immediately provide their identity verification, either to Companies House or through their ACSP. For those already established on the register, they have a one-year grace period, but it's probably best to do it sooner rather than later.
PL: That's really helpful. Thanks very much, Mike.
MM: Thank you.
How the OBR works – 23’14”
PL: We're going to wrap up with a look at the inner workings of the OBR. Public finance expert and adviser to ICAEW Martin Wheatcroft is with me. Hello, Martin.
Martin Wheatcroft: Hello. How are you, Philippa?
PL: Good – welcome back. Should we start at the beginning? When and why was the OBR set up?
MW: The traditional thing when a new government comes in is to blame their predecessors for everything that's gone wrong.
PL: Absolutely.
MW: And that was what happened when George Osborne came in as Chancellor with the coalition government in 2010. The perceived weaknesses in the previous government's management of the public finances led him to introduce the Office for Budget Responsibility. It's partly to address concerns about the temptation the Treasury might have to massage fiscal forecasts to get the right answer, but also, it's best practice internationally. Many other countries have a similar body.
PL: So specifically, what is it supposed to do?
MW: Well, specifically what it does is prepare independently the fiscal, economic and fiscal forecasts for the government, and that's what the Treasury then uses for the budget. And that independent preparation means that the debt markets in particular have some confidence that the forecasts are not being fiddled about with by the Treasury.
PL: I mean, it feels to me – I don't know whether this is accurate – it feels to me a little bit shadowy. I don't even know how many people work there. How big is the organisation?
MW: It's a relatively small organisation. There's about 50 people in total, of about 30 to 40 are economists and forecasters who actually do the work of the OBR and it's headed by a five-person board comprising three executive board members, the so-called Budget Responsibility Committee.
PL: Who appoints them?
MW: So the government appoints them. But it's a fairly robust process, and everybody who's been appointed. Is a sufficiently independent individual. A couple of them have worked for the Treasury in the past, but they're now independent, and seem to be independent.
PL: Okay, so that's a perfectly robust process. Is it a fixed-term role? Yes.
MW: So Richard Hughes, for example, the chair has a fixed five-year term. He's just been appointed this year for a second five-year term, and that's similar to his predecessor, Sir Robert Chote, who served for 10 years as the first OBR chair.
PL: Okay, now we often sloppily talk about the OBR forecast, but it actually is projections, isn't it?
MW: Yes. So technically, they don't do forecasts. They don't try and predict what will happen. What they do is put together an estimate of how the world might look if things happen as we expect them, as of the date of their projections. So they start by updating their model for the economy. They look at trends for inflation, interest rates, employment, productivity, migration, international trade, all that good stuff. And then they turn that into a projection for tax receipts and welfare payments based on the current welfare rules and tax rules and the level of interest, debt interest as well. They then overlay that with the previously announced government spending plans. So that's based on the spending review that happened three-year spending review that happened earlier this year.
PL: Okay, so the intention is they bring all this specific data in, and then they attempt to look at real-world outcomes, take into account perhaps unintended consequences.
MW: Yes. So every six months or so, they update the projections for what's happened in the real world and altering views of the future. And then they turn the handle again, and then the government then gives them their plans that they're going to put into the fiscal event. So for example, tax rises, spending changes, all those sorts of things. And the OBR makes an estimate of the economic impact of those because if you increase taxes, you might get mechanically an extra bit of tax, but there might be an economic consequence to that – that means that you don't get the full amount. And that's what we saw with the national insurance rise earlier this year. We've got less because employers cut back on staffing and, yeah, absolutely not the full amount of tax receipts that we might have otherwise expected.
PL: Okay, so they effectively run their first draft past the government. Then the government tells them their spending plans, and then they factor their spending plans in. Is there a cut-off point? Is there a timeframe beyond which they have to know? Obviously they need to know what those spending plans are.
MW: Well, I mean, it's actually relatively close to the budget. It's a few days before, about a week before, but they're continuing to turn that handle because depending on what the OBRs view is on things, the Treasury will either go back and say: ‘we think it might be better than this’, or they'll say: ‘we still need a bit more money. So we're going to come up with idea number two or three, or 50’.
PL: So there is a bit of horse trading?
MW: There's a bit of horse trading in that iterative process of preparing the projections. But it's also, you know, quite a rigorous process, because the OBR is designed to provide some rigour to the process.
PL: But how accurate do these forecasts tend to be?
MW: There's two ways of looking at that. So my personal view is the projections are always perfect and accurate. The problem is, reality is usually wrong.
PL: According to the facts as they have them, their projections are excellent.
MW: Their projections are excellent. Of course, it's not quite as perfect as that, and the OBR is by no means prescient and able to predict the future very well, but they do their best. But of course, reality comes along. And so, for example, talking about the employer’s national insurance, the economic damage that that's caused has been much higher than the OBR expected. And so that's one of the challenges the OBR will come in, and they're reassessing the projections at the moment.
PL: Now, obviously, these projections, they really matter, don't they, for a wide variety of sectors, not just Westminster and the Chancellor. But what is your sense of how Westminster views the OBR?
MW: I think the key thing is just to step back a little bit and remember that the OBR is just one part of a bigger system that consists of fiscal rules that the chancellor has, and the whole Fiscal Responsibility Framework that George Osborne introduced at the time. And that overall framework means that the OBR has a couple of different effects. One, it's particularly constraining on Chancellors, and means that they are constrained in their choices, and that was part of the design. The problem is that it was very popular with George Osborne when he was Chancellor, but less so with many of his successors.
PL: And less so now, as we understand it.
MW: Yes, and so you've had an evolution. The OBR, I think, is a pretty operationally independent institution that is respected, particularly in the economics world and by many people, but disliked by sort of backbenchers and Cabinet members other than the Chancellor, because it's a tool by which the Chancellor says no. That's quite important for Chancellors. I mean, it is a good tool for them to say no. The problem is that it also says no to the Chancellor as well.
PL: Well, yes. Which brings us to the fact that there has been talk, hasn't there, that Rachel Reeves is trying to find a way to minimise the impact of the OBR on her decision making. Yes,
MW: Although, I mean, as I said, it's probably the other parts of the framework that are more challenging for her fiscal rules that she sat and is insisting that she's going to stand by are the real constraints and the fact that she's left herself very little headroom. So the problem in the spring statement was that she had such little headroom, relatively small changes meant that she lost her headroom, and then she had to fiddle around with the numbers a little bit to make the spreadsheet work.
PL: But you can see how, from her point of view, it might be quite helpful if the OBR had to, for example, factor in the government's growth measures, or perhaps didn't do quite so many forecasts every year.
MW: Yes, I think one of the challenges is that the OBR does factor in the growth measures. The real problem is that they tend to factor them in after the five-year time horizon. So you do get a boost of growth, but of course, most of these measures that the government is bringing in are not instantaneous. They take a while to flow through to the economy, so the OBR scoring of them is not in time to help the fiscal rules, which are based on a five or four-year time horizon.
PL: She's not alone, is she? The New Economics Foundation has gone further. They've called for the forecasting to be taken back in-house, back into the Treasury. I mean, what do you make of that?
MW: I think there's a real concern there around the confidence of debt markets, because now they've been introduced to independent forecasting, there would be a real concern about the motivation for doing that, and the temptation of the Treasury just to tweak the forecast a little bit, obviously, for very good reasons.
PL: Obviously.
MW: But that's a slippery slope. And so whilst I wouldn't say the OBR is sacrosanct, and there are scenarios in which you could see a different arrangement, but it's it's difficult to see how you get from there to here, particularly at a time when debt markets are quite sensitive to what's going on with the public finances, because they aren't in a great shape.
PL: So it speaks to probity. It speaks to reliability.
MW: Indeed.
PL: Would you like to see it just left alone?
MW: I'm sort of in two minds here, because I think the process is quite good, but the challenge is that in the current context, the government's in a hole, and how it gets out of that hole is quite difficult, and the process and the fiscal rules in particular are making it difficult for the Chancellor to get out of that hole.
PL: But you wouldn't argue that she should tweet the assist herself?
MW: I think the civil service phrase is: ‘that would be very brave, minister’. I would not recommend it as a tactical thing, and I think it would be something that you need to come up with a replacement system that gave confidence to the markets, because they are nervous.
PL: Could be a lot of unintended consequences there.
MW: Yeah.
PL: Martin, thank you very much, as always.
Next time on the podcast, we'll be dissecting the Financial Reporting Council's new guidance on using AI in audit. We'll have an FRC guest to talk us through it, along with two of the senior auditors who fed into that guidance.
Over on our sister podcast, the tax track, the team are taking another look at Making Tax Digital with chartered accountant and MTD expert Rebecca Benneyworth: just how well prepared is the profession right now? If that's your area, you can find the podcast on any app. Thanks for listening.