ricardocalleja / Analysing_consumption_habits

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Analysing consumption habits

During the pandemic caused by Sars-cov 19🦠 I was in Chile as an inmigrant. I had just arrived on January 17th 2020 at the Capital Santiago with my mother👵, an eldery (74). Furtunately I found a convenient located appartment next to the supermarket 🛒 "Lider" which is the Chilean version of Walmart.

Since the confinement started on March 22th. We decided to go out only once a week to buy groceries, preferably during non-busy hours (around 9am). Since I was unemployed I rigorously controled my budget, and doing so I kept all the receipts from the grocery store, then after almost a year I had 70 receipts of this peculiar 2020.

In the next lines I will show you How did I analyse the data from the groceries store's receipts.

The data

This is how one of those receipts🧾 looks like. Well, my first approach to extract the data from the receipts was to scan it using th Goggle Drive functionality, and after that, to use google vision API to convert the pdf file to text and this was the result. I asked for help on regular expressions and my friend @madacol helped me with this regex that was able to detect 4 important fields (product_id, quantity, product_description and unit_price). But due to the format it only detected 26 out of 30 items in the receipt.

In order to find a solution to this issue I found that there was an option to download the receipt in a digital format through their website. So this was actually what I did. I manually downloaded the receipts. If anybody knows how to do it automatically please let me know. So, here you will find the 70 receipts. I copied and pasted the content of all pdf files and using a regexp extracted the previously mentioned fields and put them in into excel to finally get this dataset.

About


Languages

Language:Jupyter Notebook 100.0%