#Getting and Cleaning Data - Course project
This data set is based on the Human Activity Recognition Using Smartphones Dataset (v1.0) from www.smartlab.ws. The HAR data set contains 30 subjects, each of which participated in 6 different activities while measurments from a phone on the subject were recorded. Of the over 500 features available in the HAR data set, 64 were chosen to be summarized in this data set. Those features chosen were then averaged so that we have one record per subject*activity pair.
This data set includes the following files:
- README.md - the file you are reading
- code_book.md - contains a description of column names as well as a full list of all columns
- run_analysis.R - R-script to calculate and write 'tidy_data.txt'
- tidy_data.txt - tidy data set containing the mean values of each feature chosen from the original data set for every combination of activity and subject_id
For each record in the data set, the following are provided:
- activity_name: which of the 6 activities the record pertains to, can be one of:
- STANDING
- SITTING
- LAYING
- WALKING
- WALKING_DOWNSTAIRS
- WALKING_UPSTAIRS
- subject_id: which of the 30 subjects the data was recorded from
- 64 additional feature columns, each containing a summary statistic calculated from the original data set (see
code_book.md
for more)
The script operates on the original HAR data set via the following steps:
- Load the data
- this step loads the various files from both the train and test folders into individual data.frames.
- Filter the list of features to only the ones that contain
mean()
orstd()
- Add the subject ID (from
subject_test.txt
andsubject_train.txt
) as a new column intest_df
andtrain_df
, respectively. - Add in the activity categories (as integers) from
y_test.txt
andy_train.txt
as a new column intest_df
andtrain_df
, respectively. - Merge
test_df
andtrain_df
into 'full_df' viarbind()
. - Create a Factor column from the
activity_id
column, applying the appropriate labels - Use the
plyr
library functionddply
to reshape the data set into a tidy, summarized data set.- This splits the data set into groups by
activity_name
andsubject_id
, and then calculates the mean for each column named invalid_features
via thecolwise
function.
- This splits the data set into groups by
- Do a whole bunch of string substitutions to convert the original feature names into more descriptive names
- Write the new, tidy data set out to a file called
tidy_data.txt
.
###NOTES:
-
When loading the measurement files (X_test.txt and X_train.txt), I use
check.names=F
to make sure that the column names are not altered (e.g.mean()-Y
does not get changed tomean...Y
) -
We want to only keep the features that are the result of calculating a mean or standard deviation on some set of measurements from the phone. When looking at all of the features available, finding
mean
orstd
in the name was not enough to filter by, as some features, such asangle(tBodyGyroJerkMean,gravityMean)
andfBodyBodyGyroJerkMag-meanFreq()
do not appear to fit the desired criteria (the first represents an angle, and the second a mean frequency, instead of a mean value.) Therefore, only those feature names that include eithermean()
orstd()
are included in the list of features to keep.