A python script for big data purification of a sqlite3 database.
This repository contains code that cleans big data - company names from the sqlite3 database semos_company_names.db.
- The class CleanCompanyNames from the script
sqlite3_with_pandas.py
presents the big data purification in three ways - with the multiprocessing package, with the threading library and without these. The database update is done in two ways - with dataframe to sql and with SQL update queries. This script also shows the time needed to complete the purification.
Author
- Kristina Jovanovska (kristina.jovanovska@protonmail.com)
- Requirements
- Database
- How to use this code
- Use your own database
For this script you need the following libraries, modules and packages:
- sqlite3 (documentation)
- threading (documentation)
- time (documentation)
- pandas (documentation , installation)
- multiprocessing (documentation , pip_install)
The sqlite3 database semos_company_names.db has one table - companies, 3 columns (id, name, company_name_cleaned) and 20 000 rows.
-
To purify the big data
1.1. without the multiprocessing package nor the threading library- insqlite3_with_pandas.py
uncomment only the following lines-
with dataframe to sql
ccn = CleanCompanyNames() ccn.run_program_df()
-
with SQL update queries
return df
ccn = CleanCompanyNames() ccn.run_program_sql()
1.2. with the multiprocessing package - in
sqlite3_with_pandas.py
uncomment only the following lines-
with dataframe to sql
ccn = CleanCompanyNames() ccn.with_multiprocessing_df()
-
with SQL update queries
return df
ccn = CleanCompanyNames() ccn.with_multiprocessing_sql()
1.3. with the threading library - in
sqlite3_with_pandas.py
uncomment only the following lines-
with dataframe to sql
ccn = CleanCompanyNames() ccn.with_threading_df()
-
with SQL update queries
return df
ccn = CleanCompanyNames() ccn.with_threading_sql()
-
-
Print rows from the sqlite3 database - in
sqlite3_with_pandas.py
uncomment only the following linesccn = CleanCompanyNames() ccn.print_rows()
- Change the database name, table name and column names to your own sqlite3 database name, table name and column names.
- In the method pd.read_sql() change the chunksize appropriate to the number of rows in your table.
- In the methods that use SQL update queries
change the following line
to define the start, end, and chunksize of a single query.
n = [i for i in range(0, 25000, 5000)]
- Use the replace() method for any additional occurrences of substrings that need to be replaced.