mateusnobre / financial_statements_elt

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This is an Extract-Load-Transform pipeline built on top of Luigi that gather financial statements data since 2011 from (Comissão de Valores Mobiliários), load it into a PostgreSQL database and process it to make the data more ready for comsuption.


  • python 3.8.5
  • pipenv python package (you can installing running pip install pipenv)
  • docker

Getting Started

clone the repo and enter its directory

git clone
cd financial_statement_elt

Setting up the database (only if you don't already have a DB for tests) - Require Docker Installed

On Windows:\

Setting up Database Connection (here you'll open a .env file and insert DATABASE, SERVER, UID and PASSWORD credentials)

On Linux

cp .env.sample .env
nano .env

On Windows

copy .env.sample .env
notepad .env

Install postgres on your machine

On Linux (Ubuntu):

Setting up the Python Virtual Environment and Installing Required Packages

On both Linux and Windows (make sure to change that you are in the path of the repository working directory)

pipenv shell
pipenv install

Running the Pipeline

On the script, you can change the Global Variables YEARS (what years to load data from), FILE_PREFIXES (if you want to process quarterly and/or yearly data) and TABLE_SUFFIXES (what tables to process)


Scheduling the Pipeline

On Windows:,on%20it%20to%20proceed%20further.


After running the commands above, you'll have:

  • A PostgreSQL Database running inside a docker container on your machine with two schemas: staging (raw data) and data_warehouse (processed data)
  • A cron job that runs that process every month to get the latest data

Checking when the data is updated

We get our data from that site, and here you can check if your desired year data was updated (every week the data is updated with corrections and re-presentations)

What you can do with it?

Access thousands of documents of listed companies on B3 within a single query and make analysis of some company financial health very quickly

How to use it -- OUTDATED

You can choose between the 15 tables and write a query like this

select *
from dwh.balanco_ativo_ind
where cnpj_cia = '47.508.411/0001-56'
    and quarter in ('2017Q4', '2018Q4', '2019Q4', '2020Q3')

Or this to get the data you want of the quarters () you want:

select *
from dwh.demonstracao_resultado_ind
where denom_cia = '%NomeDaEmpresa%'
    and quarter in ('CUM_2017Q4', 'CUM_2018Q4', '2019Q2', 'CUM_2020Q3')

Filling the spreadsheet (make sure the CNPJ is in this format: 47.508.411/0001-56)

python CNPJ_CIA


☐ Check last modified date on each year before downloading it (tip: use BeautifulSoup) ☐ Storage optimizations (Create tables to store info from the companies and info from ds_conta strings) ☐ Process DVA, DMPL, DFC_MI, DFC_MD and CIAS_ABERTAS data into the data warehouse (tip: use already existent sql as base, the logic is almost the same)



Language:Python 100.0%