A common rule of thumb for data scientist is that the data preparation process will take approximately 80% of the total time on a project. Not only is this process time consuming, but it is also considered one of the less enjoyable components of a project (Forbes, 2016). To help address this problem, we have decided to build a package that will help improve some of the common techniques used in data preparation. This includes a function that will streamline the process of splitting a dataset into testing and training data (and provide a model ready output!), a function that incorporates more standardization methods then a data scientist could ever want and a function that will allow data scientist to quickly understand the columns and quantity with NA
values in a dataset.
Description: create a function that operates in a similar manner to scikit-learns implementation of train_test_split
. Accepts a tbl_df
, df
, or data.frame
as input. Returns tbl_df
as output for each train and each test set.
Input Parameters | Input Type | Output Parameters | Output Type |
---|---|---|---|
X | tbl_df,data.frame,tbl | y train | 1D tbl_df |
target_column | integer, string | y test | 1D tbl_df |
split_size | numeric | X train | tbl_df |
seed | numeric | X test | tbl_df |
Description: standardize features. Accepts tbl_df
, tbl
, or data.frame
as input. Returns tbl_df
as output.
Input Parameters | Input Type | Output Parameters | Output Type |
---|---|---|---|
X | tbl_df, data.frame, tbl | X_standardized | tbl_df |
col_index | vector of indices | ||
method | string | ||
method_args | named vector |
Description: summarise the missing data (NA
values) in a dataset. Accepts tbl_df
, tbl
, or data.frame
as input. Returns tidy tbl_df
with a column for NA
count and NA
proportion.
Input Parameters | Input Type | Output Parameters | Output Type |
---|---|---|---|
X | tbl_df,data.frame,tbl | X_na_counter | tbl_df |
This function does not currently exist in R. However, the function sample from the base R package will be relevant. Specifically, it returns a random subset of an input vector based on a specified size. In order apply a random split to the training/testing data based on a specified percentage, we will need to leverage this function.
The function scale exists in the base R package. It allows you to standardize by:
- Subtracting mean and dividing by standard deviation
- Subtracting mean
- Dividing by standard deviation
However, this function is not a one-stop shop to scale by:
- Subtracting a first value, then dividing by second (a user specified mean and standard deviation)
- Making a range from a start to end value (to linearly transform the data from a user specified minimum to maximum)
As a result, this function will allow users more options for their method of standardization.
These standardization techniques are based on the Minitab documentation.
This function does not currently exist in R. However, the function is.na() from the base R package is relevant because it allows users to identify NA
values. Specifically, it returns a logical vector of the same length as the argument x
, where the vector will contain TRUE for NA
elements and FALSE otherwise. To count the number of NA
values in each column, we will need to leverage this existing function.