Since COVID-19 is the year most hot topic i decided that might be a good idea to test my newly created experiment with Go in order to check the Benford Law on public datasets.
So first of all what is the Benford's Law?
According to online resources the Benford’s law (also called the first digit law) states that the leading digits in a collection of data sets are probably going to be small.
In a nutshell, if we have a dataset containing numbers we will mostlikely find numbers starting with 1,2,3,4,5 since these will cover almost 75% of the number distribution (e.g. starting with 1 - 31% , 2 17% etc..)
My little experiment starts with the COVID-19 Dataset in order to see if those data are considered Benford's Law "Compliant".
The dataset is populated by COVID-19 cases on a daily basis around 17-18 PM CEST from the italian govt's "Protezione Civile", if you are interested into this data you can find it here:
For almost one year i integrated these datasets into a Telegram Channel in order to track down cases and make some stats that you don't get from the TV stations (if you are interested here is the link: https://t.me/covidinfoshare - powered by a scheduled GitlabCI pipeline 🙂)
Returning back to our Benford's Law, here are the results of my small script runs against the same dataset but on different scopes such as regions,province and entire nation:
TLDR: BFC-0 Not Benford Law Complaint, BFC-4 Compliant till the fifth digit
Google Sheet link here
In case you want to run the bf_checker on your machine feel free to clone the repo here and execute the command to process the remote dataset (you can find the samples in the same repo).
As you can see the results are not 100% Benford's Law Compliant as you may expect but we can see a clear pattern from local level (dataset-province) to nation level (dataset-nazionali) because of the granularity of data (if you have more numbers mostlikely you will reduce noise such as timestamps/dates an so on)
Another interesting thing about this first digit law is the enourmous number of papers/studies around this law that shows clearly that can be used for different scopes such as:
- Forensics (e.g. fake pictures, file manipulation)
- Fraud checks (Taxes payment forms...)
- Data validation
But there are always limits because not any kind of dataset can be checked for example running the checker on log files may lead to a very large number of 2 digits (2021/xx/xx + time etc..) thus making the results less relevant (or forcing the 2 the first digit)
Links and Resources:
- Benford's checker Github
- Dataset Protezione Civile - Italian COVID-19 data
- Documentary: Connected 2020 - Episode - Numbers
CTRL+R(reverse-i-search)`add': git add . && git commit -m "2021-03-19-benford-law-checker Commit" && git push origin master