The dptools
Python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.
The package currently encompasses the following functions:
- Feature engineering:
add_date_features()
: create date and time-based featuresadd_text_features()
: create text-based features (including counts and TF-IDF)aggregate_data()
: aggregate data and create features based on aggregated statisticsencode_factors()
: perform label or dummy encoding of categorical features
- Data processing:
split_nested_features()
: split features nested in a single columnfill_missings()
: replace missings with specific valuescorrect_colnames()
: correct column names to be unique and remove foreign symbolsprint_missings()
: print information on features with missing valuesprint_factor_levels()
: print levels of categorical features
- Data cleaning:
find_correlated_features()
: identify features with a high pairwise correlationfind_constant_features()
: identify features with a single unique value
- Import and versioning:
read_csv_with_json()
: read CSV where some columns are in JSON formatsave_csv_version()
: save CSV with an automatically assigned version to prevent overwriting
The latest stable release of dptools
can be installed from PyPI:
pip install dptools
You may also install the development version from Github:
pip install git+https://github.com/kozodoi/dptools.git
After the installation, you can import the included functions:
from dptools import *
This section contains a few examples of using functions from dptools
for different data preprocessing tasks. Please refer to the docstring documentation in the implemented functions for further examples.
First, let us create a toy data frame to demonstrate the package functionality.
# import dependencies
import pandas as pd
import numpy as np
# create data frame
data = {'age': [27, np.nan, 30, 25, np.nan],
'height': [170, 168, 173, 177, 165],
'gender': ['female', 'male', np.nan, 'male', 'female'],
'income': ['high', 'medium', 'low', 'low', 'no income']}
df = pd.DataFrame(data)
age | height | gender | income |
---|---|---|---|
27.0 | 170 | female | high |
NaN | 168 | male | medium |
30.0 | 173 | NaN | low |
25.0 | 177 | male | low |
NaN | 165 | female | no income |
# aggregating the data
from dptools import aggregate_data
df_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')
gender | age_mean | age_max | height_mean | height_max | income_mode |
---|---|---|---|---|---|
female | 27.0 | 27.0 | 167.5 | 170 | 'high' |
male | 25.0 | 25.0 | 172.5 | 177 | 'low' |
# creating text-based features
from dptools import add_text_features
df_new = add_text_features(df, text_vars = 'income')
age | height | gender | income_word_count | income_char_count | income_tfidf_0 | ... | income_tfidf_3 |
---|---|---|---|---|---|---|---|
27.0 | 170 | female | 1 | 4 | 1.0 | ... | 0.0 |
NaN | 168 | male | 1 | 6 | 0.0 | ... | 1.0 |
30.0 | 173 | NaN | 1 | 3 | 0.0 | ... | 0.0 |
25.0 | 177 | male | 1 | 3 | 0.0 | ... | 0.0 |
NaN | 165 | female | 2 | 9 | 0.0 | ... | 0.0 |
# print statistics on missing values
from dptools import print_missings
print_missings(df)
Total | Percent | |
---|---|---|
age | 2 | 0.4 |
gender | 1 | 0.2 |
# displays one correlated feature from each pair
from dptools import find_correlated_features
feats = find_correlated_features(df, cutoff = 0.4, method = 'spearman')
feats
Found 1 correlated features.
['age']
# first call saves df as 'data_v1.csv'
from dptools import save_csv_version
save_csv_version('data.csv', df, index = False)
# second call saves df as 'data_v2.csv' as data_v1.csv already exists
save_csv_version('data.csv', df, index = False)
Installation requires Python 3.7+ and the following packages:
In case you need help on the included data preprocessing functions or you want to report an issue, please do so at the corresponding GitHub page.