jiko23/SQL_etl-query-faster

The following steps has been performed:

1> The first thing is to insert large data into sql table within very less time. In the program the function 'table_creation(data)' does the work. In the function it will first check if the table is already present or not. If already present then it will drop the table and then create the 'Product' table in sql database. This code has the capability to put a huge amount of data i.e. .csv file into the database table too fast.

2> The second thing to do is to get the index of data because queries like update takes the most time in searching the index into the table and then to perform updation. So, my concept was to get the data from the sql table and store it into a dataframe. In the programe the function '_index_dict' will find the indices of the data according to primary key as specified in assignment i.e. sku. The format to store will be _index_list = {'sku data': [indices]} . This takes a bit time and as per assignment this section's time complexity shouldnot be considered.

3> Third thing is to update the data. If we update records in sql table then it takes a lot of time as it has to first find at which index is the record matching and then it updates the record. The most time it takes is to find the index when there is a huge amount of records/data present in the table. Hence, in this program the function '_index_dict' did the index storing thing and here the updation will be done in the dataframe which fetched data from table hence will take less time to perform updation. In this program the update function demonstrated is an example which will update the records with duplicate 'sku'(primary key). The program gives the flexibility to the user to check the data and decide which column to perform the updation and what value to be set. Here it will ask the user if he/she wants to update a record or not if 'yes' then put the column to be updated and the value to be set. If 'no' then it will jump to next record. If the user wants to quite then 'q' and it will first drop the existing table and then create the same table with the updated data.

4> The fourth thing is to perform aggregation(count number of products for each name). After the updation of the data the 'aggregate_query' function will perform aggregation i.e. count number of products on the dataframe data that will be fetched from the table with updated data and then create a table in sql with the aggregated data which will consist of 'Product_counts' and 'name'.

This whole program has been done programatically using Python. In starting I thought that it will be difficult to do such thing using program mainly the updation process and bulk data insertion but after research I could achieve it. My concept is different because normally people will perform to insert or update the records in batches but I wanted to program something different and with totally different concept. For storing the indices of the records I first tried HashMap data structure but it was still taking lots of time as the data was large later I switched to dictionary which is also like HashTable. If I was given some more time then I would have focused on how to make the indices storing faster with any other data structure. Again I want to conclude that I tried my best to program my concept and perform the task and I could achieve atleast 95% of it except the indices storing and I would love to improve it. (Kindly change the locations mentioned in the program as per your need.)

Running the program: Just run this program i.e. python postman_2.py and keep the original data file in the same location as the program and give locations to all other files the same as the program file.

jiko23 / SQL_etl-query-faster

About

Languages