ICAEW.com works better with JavaScript enabled.

Data analytics community

Codeless Machine Learning for Auditors

Author: Dimitris Kaskadanis & Polyna Christakou

Published: 22 Sep 2022

In the new era of digitalisation, there are emerging challenges relating to the new technology risks while fraud risk increases, adopting new and innovative methods in line with technological progress. In this environment, audit should develop data analytics skills that are beyond traditional risk monitoring and fraud detection tools to meet stakeholders’ evolving expectations.

A recent article on technology upskilling of auditors published in the ICAEW’s “Audit & Beyond” magazine, May 2022 edition, suggests that there are variations among the firms in the data analysis approaches to support audits whether this demands people with specialised coding skills (e.g. R or Python) and whether those people should have audit background or merely being technical experts aiming to support the audit teams.

The good news is that there are alternatives for performing more advanced analytics in a “no-code environment”. A recurring topic for auditors has always been the identification of irregularities or anomalies in a population which may indicate errors, omissions or fraud. Clustering is a data analytics technique that can be a useful tool to assist this pursuit. Nevertheless, not all auditors are familiar with coding languages that traditionally are required to perform such calculations.

Being internal auditors, we have experimented with clustering both in coding and no-coding environments. In this context, we soon realised the potential of Power BI to support users who are not programming language experts, while enabling the execution of externally developed machine learning scripts, therefore integrating other powerful tools.

What is Clustering?

Clustering is an unsupervised Machine Learning (ML) algorithm (i.e. an algorithm that learns and improves from experience, without input from users) that looks for patterns in data by dividing it into clusters. These clusters are created such that the points are homogenous within the cluster and heterogenous across clusters. Clustering is commonly used in market segmentation and several areas of marketing analytics as well as in fraud detection.

A Practical Example: Outliers (Fraud) Detection – Credit Card Purchases

For this article, we’ve used an anonymised credit card transactions dataset obtained from Kaggle to demonstrate some of the clustering capabilities of Power BI. We emphasise that the phases of data analysis such as ingestion and cleansing, while important, are not covered here, nor are some more technical details required to produce the final graphs (though there are many online sources to support in this area). ML techniques being used should be understood, particularly in the context of external audit, and the use of any third-party visualisations in Power BI is always at the user’s own risk.

We used the ML clustering functionalities of Power BI Desktop to identify outliers that could serve the purpose of potential fraud detection, taking the following data analysis steps:

1. Exploratory Data Analysis (EDA) – Know-Your-Data

  • Power BI through the Power Query Editor offers an overview of the dataset, assisting evaluation of the data validity and cleansing needs. It allows for an analysis of the data including column statistics like error, empty cells, unique values, min, max, average, standard deviation etc…

Power query
For instance, the statistics for the ‘credit limit’ column:
Credit limit
  • Aggregated tables & matrices (like pivot tables in excel): This is the first step that should be ideally performed in every dataset. It provides a good insight into the data we are processing as you may aggregate its data type (categorical/numerical) while also identifying which columns should be further analysed and/or cleansed (e.g. nil, missing or erroneous/irrelevant values). We note that the dataset we are working on has no categorical/qualitative criteria relating to the customers e.g. demographic criteria. Therefore, we focused our analysis on the numerical aspects i.e. number of lines, total and average balances & purchases etc.

This is an example of an aggregated table where we have filtered out the null values of the ‘credit limit’ column (one case as shown above):

credit limit
  • Data cleansing: Data exploration reveals blank values in the dataset. There is a variation of techniques to deal with missing values (e.g. exclusion, fill in with average). We decided to only filter out of our analysis the rows where the credit limit is blank (one such case).
  • Correlation plot (‘get more visuals’ section – right-click on the three dots): highlights correlations among the different data columns that could be useful for our analysis. It investigates the dependence between multiple variables at the same time and highlights the most correlated variables in a data table. Note that although this may indicate related variables, additional statistical tests are required to confirm the dependence and the causality (which variable causes the movement of the other), which goes beyond our analysis.

data frequency
We notice as expected a positive correlation between payments and cash advance, as well as balance and purchases. However, we should not neglect that we did not perform a complete data cleansing process and thus may not be accurate or could be different in a real-life environment.

2. Clustering Graphs – Know-Your-Model using Visualisation

Power BI offers various clustering graphs, to assist in understanding how the model aggregates similar characteristics into clusters; the one that is most often used is the scatter plot which shows the relationship between two numerical values.

Additional visualisation is offered under the ‘get more visuals’ section – right-click on the three dots

Some useful visualisations include:

Scatter plot (included in default visuals)

Clustering helps us interpret a scatter plot. We selected to plot balance (as an independent variable on the x-axis) and purchases (as a dependent variable on the y-axis). We then opted for ‘Automatically find clusters’, that creates the graph below.

It seems that the clusters are grouped based on the volume of the balance, i.e. the amount left in the account to make purchases.

scatter graph

Clustering graph

Apart from the scatter plot visual that allows clustering and is included in the default visuals of Power BI, other clustering visuals can be found through the ‘get more visuals’ section.

We selected the ‘Clustering’ graph, as one which is Microsoft-developed. This visual uses a well-known k-means clustering algorithm. You can control the algorithm parameters and the visual attributes to suit your needs.

We note that the specific graph (as well as other graphs that run ML algorithms) require the installation of R studio.

clustering

In both clustering graphs, Power BI automatically chose to classify data under 3 clusters. However, the user may opt for a predefined number of clusters, following a methodological assessment like the ‘Elbow criterion’ or the ‘Silhouette coefficient method’.

Outliers Detection graph

The outliers are those data points that are away from regular data points. As you may observe the outliers are the same in both charts above and are these data points with higher purchases that are away from the main population and do not form a tight cluster.

We also employed a graph to confirm our understanding of the outliers noted through the clustering process.

The ‘Outliers Detection’ graph – which again you can find under the ‘get more visuals’ section – can be used to confirm understanding of outliers noted through the clustering process. Here, it denotes the blue-coloured data points as the inliers and the red-coloured as the outliers.

outliers

Power BI gives the user a good overview of this graph. In particular, in this Custom Visual, we can implement one of five popular detection methods: Z-score, Tukey’s method, Local Outlier Factor - LOF method, Cook's distance, and by manually defining upper and lower thresholds.

We selected the Cook’s distance as this is the most commonly used diagnostic statistical value.

One Cluster to Rule Them All?

There is a lot of guidance and various videos explaining in detail each of the graphs above so that you can select the one that suits your needs. In this article, we selected the ones we considered to be more appropriate for a meaningful analysis and understanding that offers a quick and useful overview, but there are many more to choose from.

Following the identification of the outliers, you could choose a sample for further investigation to understand the reasons behind the outlier classification and verify whether these could relate to fraudulent transactions.

An auditor may benefit more from the flexibility and variety of a programming language that offers a deep dive view through machine learning and the ability to more precisely define the clustering, however, lack of coding skills should not be an obstacle when there are no-code alternatives that can serve the same purpose. Of course, even though Power BI offers a variety of choices – or can remove choice entirely, for example through the automated determination of clusters – there are many sources available for a user to go deeper and have a better understanding of the mechanics behind calculations!

Notes about the dataset

This sample Dataset summarizes the usage behaviour of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioural variables. Given that this is a public dataset, there is access to numerous freely available analyses in the web where various users experiment themselves using programming languages like Python or R.

About the authors:

Dimitris is the Head of Eurobank Internal Audit Data Analytics team. He is a Fellow Chartered Accountant (FCA) and is a Project Management Professional (PMP®)

Polyna is member of both Eurobank Internal Audit Data Analytics team and the Finance & Global Markets Audit Division. She is an Associate Chartered Accountant (ACA) and holds the ICAEW Data Analytics Certificate