apchiodo / prepackR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool



  1. Jingyun Chen: jchen9314
  2. Anthony Chiodo: apchiodo
  3. Sarah Watts: smwatts


A common rule of thumb for data scientist is that the data preparation process will take approximately 80% of the total time on a project. Not only is this process time consuming, but it is also considered one of the less enjoyable components of a project (Forbes, 2016). To help address this problem, we have decided to build a package that will help improve some of the common techniques used in data preparation. This includes a function that will streamline the process of splitting a dataset into testing and training data (and provide a model ready output!), a function that incorporates more standardization methods then a data scientist could ever want and a function that will allow data scientist to quickly understand the columns and quantity with NA values in a dataset.

Function Descriptions

splitter(X, target_column, split_size, seed)

Description: create a function that operates in a similar manner to scikit-learns implementation of train_test_split. Accepts a tbl_df, df, or data.frame as input. Returns tbl_df as output for each train and each test set.

Input Parameters Input Type Output Parameters Output Type
X tbl_df,data.frame,tbl y train 1D tbl_df
target_column integer, string y test 1D tbl_df
split_size numeric X train tbl_df
seed numeric X test tbl_df

stdizer(X, col_index=None, method, method_args)

Description: standardize features. Accepts tbl_df, tbl, or data.frame as input. Returns tbl_df as output.

Input Parameters Input Type Output Parameters Output Type
X tbl_df, data.frame, tbl X_standardized tbl_df
col_index vector of indices
method string
method_args named vector


Description: summarise the missing data (NA values) in a dataset. Accepts tbl_df, tbl, or data.frame as input. Returns tidy tbl_df with a column for NA count and NA proportion.

Input Parameters Input Type Output Parameters Output Type
X tbl_df,data.frame,tbl X_na_counter tbl_df

Relationship to the R ecosystem


This function does not currently exist in R. However, the function sample from the base R package will be relevant. Specifically, it returns a random subset of an input vector based on a specified size. In order apply a random split to the training/testing data based on a specified percentage, we will need to leverage this function.


The function scale exists in the base R package. It allows you to standardize by:

  1. Subtracting mean and dividing by standard deviation
  2. Subtracting mean
  3. Dividing by standard deviation

However, this function is not a one-stop shop to scale by:

  1. Subtracting a first value, then dividing by second (a user specified mean and standard deviation)
  2. Making a range from a start to end value (to linearly transform the data from a user specified minimum to maximum)

As a result, this function will allow users more options for their method of standardization.

These standardization techniques are based on the Minitab documentation.


This function does not currently exist in R. However, the function is.na() from the base R package is relevant because it allows users to identify NA values. Specifically, it returns a logical vector of the same length as the argument x, where the vector will contain TRUE for NA elements and FALSE otherwise. To count the number of NA values in each column, we will need to leverage this existing function.




Language:R 100.0%