andrewhoulbrook / benfords-law

A script to compare a given dataset to Benford's Law and its generalized form.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benford's Law

A script to compare a given dataset to Benford's Law and its generalised form.

Getting Started

This script is written largely as the product of my learning about and experimenting with Benford's Law, especially its application to fraud detection tasks.

I'd strongly recommend checking out Marcel Milcent's excellent Python package, which is a far more professional offering than my humble effort. Marcel's work is also now available as a PyPi Package!

Benford's Law

Benford's Law, also known as the Law of First Digits, is the finding that for certain types of datasets, the first digits (or numerals) of numbers contained within a series do not display a uniform distribution, but rather follow a distribution shown below:

A more detailed overview of Benford's Law can be found here and here.

Benford's Law is known to apply across a diverse range of datasets, for example: census data, street addresses, stock prices, house prices, population data, death rates, lengths of rivers, areas of land masses and lottery numbers.

In general, Benford's Law usually holds when a series of numerical data has some (or most) of the following characteristics:

  • Data values have varying degrees of magnitude
  • Data values formed through mathematical combination of numbers from several distributions
  • Data with no pre-determined minimum or maximum limits
  • A relatively large number of records
  • Data doesn't contain identifier/index-type numbers (e.g. SSINs, account numbers, phone numbers etc...)
  • Data is right-skewed, mean is less than the median

Data that conforms to the above characteristics is common in accounting which has lead to Benford's Law being used as a heuristic tool to detect the potential of fraud.

Breaking the Law - Fraud Detection

Benford's Law can also be generalised beyond the first digit to consider probability distributions for the first two and first three digits, for example. Distribution of the last two digits has also been suggested for identifying artificially rounded or fabricated data and flag data that could warrant further investigation.

This report by the Association of Certified Fraud Examiner's (ACFE) provides much more background on the application of Benford's Law to fraud detection and generalising Benford's Law to include analysis of second and third digit distributions.

The first digit distribution alone doesn't help narrow the data into a subset of potential fraudulent data. It also doesn't necessarily help generate specific leads for investigators (sample sizes for first digit tests often impractical to manage). Tests for conformity to Benford's Law are often described as indicating the possibility of the presence of fraud.

It's worth noting that if, for example, certain accounts data are expected to conform to Benford’s law but don't, it doesn’t necessarily mean the data are fraudulent. It could however offer reason for further investigation.

Prerequisites

The script is written in Python 2.7.

Requires the following Python modules installed too if you don't already have them:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Installing

Required modules are available from PyPI and can easily be installed as follows:

pip install numpy
pip install pandas
pip install matplotlib

Note: these are not minor installs! Read more about NumPy and Pandas.

Running the Script

Usage Example:

>>> python benford.py 1 data/fibonacci.csv

The first parameter denotes the type of Benford test to be performed. Options are:

  • '1' = First digit distribution test
  • '2' = Second digit distribution test
  • '3' = Third digit distribution test
  • '12' = First two digits joint distribution test
  • '123' = First three digits joint distribution test

The second parameter is the file path to the user's dataset.

The script's file handling is a bit rudimentary. It reads csv text file formatted as a one-column data file, with no header row.

Other Usage Information

It is left to the user to control and normalise the dataset that the script will run on. For example, the user should control for negative numbers in the dataset prior to running the script. This will most likely be a common pre-processing task for accountancy data.

Test Data

I've included some test datasets in the /data repo, which include:

  • The first 500 Fibonacci numbers - data source. An almost perfect fit for the Benford Distribution.
First Digit Distribution First Two Digits Joint Distribution
Fibonacci Distribution Fibonacci Distribution
  • World population by country in 2017 - data source. A statistically significant fit for the Benford Distribution.
First Digit Distribution First Two Digits Joint Distribution
World Population World Population
  • Monthly returns (Dec 1990 - May 2005) from Fairfield Sentry Ltd (a former Bernie Madoff fund) - data source. Visible deviations from expected Benford distributions, but notably not statistically significant under Chi-Squared test. Read more here about whether Madoff may have evaded detection via Benford analysis.
First Digit Distribution First Two Digits Joint Distribution
Fairfield Sentry Monthly Returns Fairfield Sentry Monthly Returns
  • China 2010 Census, urban area populations - data source. The data is not a fit for the first digit Benford distribution under the Chi-Squared test but does fit other digit distributions.
First Digit Distribution First Two Digits Joint Distribution
Urban Populations in China Urban Populations in China

Plenty more examples at the fantastic testingbenford.com

To Do

  • Implement last two digits test (detect artificial rounding)
  • Add error handling
  • Improve input file handling and reading data from multiple file formats
  • Sampling issues with Chi-Squared can make it less than ideal. Implement alternative test statistics (e.g. Z tests, MAD, Komologrov-Smirnov etc...)

Built With

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

A script to compare a given dataset to Benford's Law and its generalized form.


Languages

Language:Python 100.0%