prepackR

Team

Jingyun Chen: jchen9314
Anthony Chiodo: apchiodo
Sarah Watts: smwatts

Topic

A common rule of thumb for data scientist is that the data preparation process will take approximately 80% of the total time on a project. Not only is this process time consuming, but it is also considered one of the less enjoyable components of a project (Forbes, 2016). To help address this problem, we have decided to build a package that will help improve some of the common techniques used in data preparation. This includes a function that will streamline the process of splitting a dataset into testing and training data (and provide a model ready output!), a function that incorporates more standardization methods then a data scientist could ever want and a function that will allow data scientist to quickly understand the columns and quantity with NA values in a dataset.

Function Descriptions

`splitter(X, target_column, split_size, seed)`

Description: create a function that operates in a similar manner to scikit-learns implementation of train_test_split. Accepts a tbl_df, df, or data.frame as input. Returns tbl_df as output for each train and each test set.

Input Parameters	Input Type	Output Parameters	Output Type
X	tbl_df,data.frame,tbl	y train	1D tbl_df
target_column	integer, string	y test	1D tbl_df
split_size	numeric	X train	tbl_df
seed	numeric	X test	tbl_df

`stdizer(X, col_index=None, method, method_args)`

Description: standardize features. Accepts tbl_df, tbl, or data.frame as input. Returns tbl_df as output.

Input Parameters	Input Type	Output Parameters	Output Type
X	tbl_df, data.frame, tbl	X_standardized	tbl_df
col_index	vector of indices
method	string
method_args	named vector

`na_counter(X)`

Description: summarise the missing data (NA values) in a dataset. Accepts tbl_df, tbl, or data.frame as input. Returns tidy tbl_df with a column for NA count and NA proportion.

Input Parameters	Input Type	Output Parameters	Output Type
X	tbl_df,data.frame,tbl	X_na_counter	tbl_df

Relationship to the R ecosystem

`splitter`

This function does not currently exist in R. However, the function sample from the base R package will be relevant. Specifically, it returns a random subset of an input vector based on a specified size. In order apply a random split to the training/testing data based on a specified percentage, we will need to leverage this function.

`stdizer`

The function scale exists in the base R package. It allows you to standardize by:

Subtracting mean and dividing by standard deviation
Subtracting mean
Dividing by standard deviation

However, this function is not a one-stop shop to scale by:

Subtracting a first value, then dividing by second (a user specified mean and standard deviation)
Making a range from a start to end value (to linearly transform the data from a user specified minimum to maximum)

As a result, this function will allow users more options for their method of standardization.

These standardization techniques are based on the Minitab documentation.

`na_counter`

This function does not currently exist in R. However, the function is.na() from the base R package is relevant because it allows users to identify NA values. Specifically, it returns a logical vector of the same length as the argument x, where the vector will contain TRUE for NA elements and FALSE otherwise. To count the number of NA values in each column, we will need to leverage this existing function.

apchiodo / prepackR