delphieritas / kaggle_credit_dataset_processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kaggle 'Home Credit' Dataset Processing

This document is to process the following kaggle dataset, and to concat supplemental tables (POS_CASH_balance.csv, credit_card_balance.csv, previous_application.csv, installments_payments.csv, bureau_balance.csv and bureau.csv) into application_train.csv:

https://www.kaggle.com/c/home-credit-default-risk/data?select=application_train.csv

There are 7 different sources of data:

application_train: the main training data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits. bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length. previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV. POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows. credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows. installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.


According to the dataset description, table bureau_balance.csv will firstly inner join bureau.csv on 'SK_ID_BUREAU' attribute, forming a combined bureau table.

Then the following 4 tables, together with the above combined bureau table, will inner join application_train.csv on 'SK_ID_CURR' attribute: POS_CASH_balance.csv credit_card_balance.csv previous_application.csv installments_payments.csv

Each supplemental tables will eventually become relevant attributes in the final combined csv w.r.t. 'SK_ID_CURR'.

In consideration of keeping the potential connections among supplemental tables, we sort each supplemental table on their key attribute before padding them into final combined file.


Code scripts of processing the above operation are provided here

import dependency and set dataset directory, pandas package is required for '1.3.1' version or higher

import pandas as pd
import os

# set directory
folder = '.../dataset/'

define processing function

def handler(base_table, supplemental_table, key_attribute, folder):
    '''
    base_table: a string of the base table name, e.g. 'application_train'
    supplemental_table: a list of supplemental csv file name strings, e.g. ['previous_application', 'installments_payments']
    key_attribute: a list of key_attribute strings, e.g. ['SK_ID_CURR', 'SK_ID_PREV']
    folder: a string of the dataset directory, e.g. '/tmp/user/Documents/Dataset/'
    '''

    # read base_table
    base_table = pd.read_csv(folder+'{}.csv'.format(base_table), dtype=object)
    # sort based on key_attribute
    base_table = base_table.sort_values(key_attribute,ignore_index=True)

    for idx in supplemental_table:
        # create a new column with supplemental_table name in base table, and fill with empty string ''
        # by filling with empty string '', the csv file could save the empty value as '\<float> nan'(numpy.nan) type.
        base_table[idx] = ''

        # read one supplemental_table as df
        df = pd.read_csv(folder+'{}.csv'.format(idx), dtype=object)
        # sort supplemental_table based on key_attribute
        df = df.sort_values(key_attribute,ignore_index=True)

        # loop through base_table entries
        for i in base_table.index:
            # i as each entry index in base_table
            # make mask from supplemental_table which share the same key_attribute in selected base_table entry
            mask = (df.loc[:,key_attribute] == base_table[key_attribute].iloc[i]).all(axis=1)
            if mask.any():
                # extract supplemental entries and drop duplicated key_attribute in  extracted supplemental_table
                selected_entries = df[mask].drop(df.columns[df.columns.isin(['SK_ID_PREV', 'SK_ID_CURR', 'SK_ID_BUREAU',*key_attribute])], axis=1)
                for j in selected_entries:
                    # j as each attribute in supplemental_table columns, loop across columns
                    if j == selected_entries.columns[0]:
                        # convert the first selected supplemental_table attribute into pandas Series, all values converted into string type
                        str_list = selected_entries[j].map(str)
                    else:
                        # concat following selected supplemental_table attributes, with '|' as the attribute separator, into pandas Series
                        str_list += '|' + selected_entries[j].map(str)
                # use ';' as the entry separator, convert the obtained pandas Series into one long string
                str_list = ';'.join(str_list) if len(str_list) != 0 else ''
                # fill the obtained string into the created new attribute in base_table
                base_table.at[i,idx] = str_list
        # inner join, remove redundant rows
        base_table = base_table.dropna(subset=[idx])
    return base_table

firstly, to prepare table 'bureau_balance', we need inner join 'bureau_balance' into 'bureau' on 'SK_ID_BUREAU' attribute

in the case of preparing 'bureau_balance', base_table refers to 'bureau', supplemental_table refers to 'bureau_balance'

# set table names
supplemental_table = ['bureau_balance']
base_table = 'bureau'
# set which key_attribute to join on
key_attribute = ['SK_ID_BUREAU']

# set combined bureau table name to save
save_file = 'combine_{}'.format(base_table)

concat 'bureau_balance' table into 'bureau' table, and save

combined_bureau = handler(base_table, supplemental_table, key_attribute, folder)
# save the combined base_table into 'combined_bureau.csv' with ',' as the attribute separator
combined_bureau.to_csv(folder+'{}.csv'.format(save_file), mode='a', index=False, header=True, sep=',')

read 'application_train' table, set 'SK_ID_CURR' as key attribute for inner join

in the case of concat 'application_train' with other tables, base_table refers to 'application_train', supplemental_table refers to the rest of tables, including the combined bureau table

note:

  • in case the files are too large, it would be good to run each supplemental_table separately against 'application_train' table
# set prepared supplemental_table names
supplemental_table = ['previous_application', 'installments_payments', 'POS_CASH_balance', 'credit_card_balance', supplemental_table[0]]
# set base table name
base_table = 'application_train'
# set which key_attribute to join on
key_attribute = ['SK_ID_CURR']
# set file name to save
save_file = 'combine_{}'.format(base_table)

now, let's concat all prepared supplemental tables into 'application_train' table, and save

combined_train = handler(base_table, supplemental_table, key_attribute, folder)
combined_train.to_csv(folder+'{}.csv'.format(save_file), mode='a', index=False, header=True, sep=',')

Code script for making inner joined files

# a list of files to inner join with application_train.csv
# note, according to the kaggle dataset website, bureau_balance.csv dose not connect to application_train.csv directly, so we don't process it here
file_to_inner_join = ['previous_application', 'installments_payments', 'POS_CASH_balance', 'credit_card_balance', 'bureau']
folder = '../dataset/'

base_file = pd.read_csv(folder+'{}.csv'.format('application_train'), dtype=object)

# firstly, make a new application_train.csv which only contains inner joined entries with all other files
for idx in file_to_inner_join:
    df=pd.read_csv(folder+'{}.csv'.format(idx), dtype=object)
    base_file=base_file[base_file['SK_ID_CURR'].isin(df['SK_ID_CURR'])]

# save this new application_train.csv
base_file.to_csv(folder+'inner_joined/inner_{}.csv'.format('application_train'), mode='a',index=False)

# secondly, based on this new application_train.csv, extract inner joined entries from all other files, and save 
for idx in file_to_inner_join:
    df=pd.read_csv(folder+'{}.csv'.format(idx), dtype=object)
    df=df[df['SK_ID_CURR'].isin(base_file['SK_ID_CURR'])]
    df.to_csv(folder+'inner_joined/inner_{}.csv'.format(idx), mode='a',index=False)
    if idx == 'burea':
        # now we handle bureau_balance.csv separately 
        df2=pd.read_csv(folder+'{}.csv'.format('bureau_balance'), dtype=object)
        # note, bureau_balance.csv connects to burea.csv on 'SK_ID_BUREAU' attribute
        df2=df2[df2['SK_ID_BUREAU'].isin(df['SK_ID_BUREAU'])]
        df2.to_csv(folder+'inner_joined/inner_{}.csv'.format('bureau_balance'), mode='a',index=False)

Code for describe dataset files

import numpy as np

for col in df:
    if (col in ['SK_ID_PREV', 'SK_ID_CURR', 'SK_ID_BUREAU']) :
        (df[col].astype('O').describe(include='all')).to_csv(folder+'{}.csv'.format(save_file), mode='a',index=True)
    else:
        df_describe = df[col].value_counts(dropna=False)
        if (df_describe.index.dtype == str) or (df_describe.index.dtype == 'O') or df_describe.size <= buffer:
            df_describe.to_csv(folder+'{}.csv'.format(save_file), mode='a',index=True)
        elif (df_describe.index.dtype == float or df_describe.index.dtype == int): 
            (df[col].describe(include='all')).to_csv(folder+'{}.csv'.format(save_file), mode='a',index=True)

def get_stat(df_name, save_file, folder='dataset/', buffer=2):
    df = pd.read_csv(folder+'{}.csv'.format(df_name))
    # make a name list for one-hot convertable categorical columns 
    df_col = [* df.columns]
    # iterate through columns
    for col in df:
        if (col in ['SK_ID_PREV', 'SK_ID_CURR', 'SK_ID_BUREAU']) :
            (df[col].astype('O').describe(include='all')).to_csv(folder+'{}.csv'.format(save_file), mode='a',index=True)
        else:
            # count categorical info
            df_describe = df[col].value_counts(dropna=False)
            if (df_describe.index.dtype == str) or (df_describe.index.dtype == 'O') or df_describe.size <= buffer:
                # save the description into csv, including category info as index
                df_describe.to_csv(folder+'{}.csv'.format(save_file), mode='a',index=True)
            # obtain min/max/mean/std info only when the attributes are numerical values
            elif (df_describe.index.dtype == float or df_describe.index.dtype == int): 
                (df[col].describe(include='all')).to_csv(folder+'{}.csv'.format(save_file), mode='a',index=True)
                # remove numerical columns names from list
                df_col.remove(col)
    # return categorical column names
    return {df_name: df_col}



file_to_describe = ['previous_application', 'installments_payments', 'POS_CASH_balance', 'credit_card_balance', 'bureau', 'bureau_balance', 'application_train']

convert_col={}

for idx in file_to_describe:
    convert_col = {get_stat(df_name = 'inner_joined/inner_{}'.format(idx), save_file = 'inner_stat/stat_{}'.format(idx), folder = folder), **convert_col}
    

Code for converting data into one-hot series

If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine.

For more than 2 unique categories, one-hot encoding is the safe option. image

import numpy as np

def to_one_hot(file_to_convert, save_file, folder='.../dataset/'):
    df = pd.read_csv(folder+'{}.csv'.format(file_to_convert))
    for col in df:
        if col not in ['SK_ID_PREV', 'SK_ID_CURR', 'SK_ID_BUREAU']:
            df_describe = df[col].value_counts(dropna=False)
            if (df_describe.index.dtype == str) or (df_describe.index.dtype == 'O'):
                # create the rule for categorical string/Object attribute converting, and make a DF list
                mapping = dict((c, i) for i, c in enumerate(df_describe.index))
                df_cat = df_describe.index.astype(str)
                # category strings/Objects --> int 
                df[col] = [mapping[char] for char in df[col]]
                if df_describe.size == 2:
                    # for Female/Male similar category
                    df_cat = col + '_' + '/'.join(df_cat)
                    df = df.rename(columns = {col:df_cat}) # axis=1
                elif df_describe.size > 2:
                    df_cat = col + '_' + df_cat
                    # make one-hot series
                    one_hot=np.eye(df_describe.size)[df[col]].astype(int).astype(str)
                    # drop otiginal categorical string attribute
                    df = df.drop(col, axis=1)
                    # concate new one-hot encoding back to DF
                    df = pd.concat([df,pd.DataFrame(one_hot, columns=[*df_cat])], axis=1)
            elif df_describe.size > 2: 
                df_col_as_float = df[col].astype('float')
                # obtain statisics
                df_describe = df_col_as_float.describe(include='all')
                # normalise numerical attributes
                df[col] = (df_col_as_float -  df_describe.loc['min']) / ( df_describe.loc['max'] - df_describe.loc['min'])
                # rename the column, adding min~max value to its original name
                df_cat = col + str(df_describe.loc['min']) + '~' + str(df_describe.loc['max'])
                df = df.rename(columns = {col:df_cat}) # axis=1
    df.to_csv(folder+'{}.csv'.format(save_file), mode='a',index=False)
    
file_to_convert = ['previous_application', 'POS_CASH_balance', 'credit_card_balance', 'bureau', 'bureau_balance', 'installments_payments', 'application_train']
               
for idx in file_to_convert:
    to_one_hot('inner_joined/inner_'+idx, 'one_hot/one_hot_'+idx, folder = folder)

About