Coursera Course: Getting and Cleaning Data
Assignment: Course Project [1]
Author: Santiago Oleas
Using data captured during the study Human Activity Recognition Using Smartphones Dataset Version 1.0 [2] gather the data and cleanse it based on the instructions described in the course project. The project has elements that are to be delivered either through github or through the Coursera web site:
- link to this Github repository that is to be provided at the Coursera Project page
- README.md: This file, which is in the Github repo.
- CodeBook.md: A codebook describing the data output and the sources
- run_analysis.R: An R program that takes the data from the study and transforms it based on instructions provided.
- step5Output.txt Tidy Data Output from run_analysis.R
This file, which provides an overview of the repo contents and describes the method that the run_analysis.R program follows
This file contains the following information pertaining to the project and the method for producing the final data output.
- Information about the experimental design study
- Information about the source variables, including the units.
- Information about the transformation of any source variables, including summaries.
This file is the R program that takes the data provided, loads, transforms and extracts in the required format. ####Process
-
Acquire Data
Get zip source data file from provided source [3] - Unzip file and observe contents Information is structured as follows: - activity_labels.txt This activity list maps to the main observation data sets - features_info.txt Describes data field measurement types. Important for answering further questions - features.txt These are the "column" names for both sets of X_test.txt and X_train.txt data - README.txt Overview doc ./UCI HAR Dataset/test/ contains files - subject_test.txt subject ID for each observation in X_test.txt - X_test.txt detailed observations - y_test activity ID for each observation in X_test.txt ./UCI HAR Dataset/test/Inertial Signals ***Can be ignored ./UCI HAR Dataset/train/ contains files - subject_train.txt subject ID for each observation in X_train.txt - X_train.txt detailed observations - y_train activity ID for each observation in X_train.txt ./UCI HAR Dataset/train/Inertial Signals ***Can be ignored Each of train and test has 3 data files that must be combined. They appear to be 3 different sets of data sets pertaining to the same observations "subject_*.txt" has the ID of the subject while the observation took place "X_*.txt" has the actual observations. There are no COLUMN names here, these can be found in the "features.txt" "y_*.txt" has the activiy label IDs. These can later be joined with activity_labels.txt to know which activity we are dealing with So this is how we have to put this together. Picture these like blocks that we either stack on top of or put next to each other. In some cases we force a label eg, "subject_id" or we force a column value ("train" and "test" to differentiate the data source) ["group"]["subject_id" ]["activity_id"][features.txt$description] ["train"][subject_test.txt ][y_test.txt ][X_test.txt ] ["test" ][subject_train.txt][y_train.txt ][X_train.txt ] The sample data will look something like this after we are done
group | subject_id | activity_id | tBodyAcc-mean()-X | tBodyAcc-mean()-Y | ... | angle(Z,gravityMean) |
---|---|---|---|---|---|---|
train | 1 | 5 | 2.8858451e-001 | -2.0294171e-02 | ... | -5.8626924E-02 |
... | ||||||
test | 2 | 5 | 2.5717778e-001 | -2.3285230e-002 | ... | -6.7430222e-001 |
-
Merge the training and the test sets to create one data set.
Load common data: - Load features.txt and place into 'features' data frame with column names featureID and featureDesc - Load activity_levels.txt and place into 'activities' data frame with column names activtyID and activityDesc Load test data - Load subject_test.txt into testSubject data frame with column name of subjectID - Load y_test.txt into testActivity data frame with column name of activityID - Load X_test.txt into testReadings data frame and force features$featureDesc as column names Load Train data - Load subject_tRAIN.txt into tRAINSubject data frame with column name of subjectID - Load y_train.txt into trainActivity data frame with column name of activityID - Load X_train.txt into trainReadings data frame and force features$featureDesc as column names Combine all test data - place testSubject, testActivity and testReadings into a single data frame called test by placing the 3 data frames next to each other using cbind() - add an additional column called test$group with value of 'test' for every observation so that we can distinguish between test and train data when we combine it all together Combine all train data - place trainSubject, trainActivity and trainReadings into a single data frame called train by placing the 3 data frames next to each other using cbind() - add an additional column called train$group with value of 'train' for every observation so that we can distinguish between test and train data when we combine it all together Combine train and test data - using rbind combine the test and train data frames into a single new one called motionData
-
Extract only the Mean and Standard Deviation for each measurement.
We need to keep the key varibles and the Mean and Standard Deviation. The SELECT function from the dplyr function is used here. A new transformed data frame called motionDataMeanSTD is created here. The required elements: - group - subjectID - activityID - all mean measurements: contains("mean") - all standard deviation measurements: contains('std') - exclude 'meanFreq' since that is not the same as mean: -contains('meanFreq') - exclude all of the 'angle' measurements. These are measuring an angle between gravity and another measurement and sometimes the word 'Mean' is in these measurements but they are not an actual mean of an observation as requested.
-
Use descriptive activity names to name the activities in the data set
We can use the MERGE command to combine the activities and the motionDataMeanSTD data frames into a new one called motionDataMeanSTDActivity. They both have activityID. We will get activities$activityDesc, which is what we need to include to fulfill the requirements for this requirement. We can then drop motionDataMeanSTDActivity$activityID
-
Appropriately labels the data set with descriptive variable names.
We will now create feature descriptions that are meaningful This is important to help us with obtaining a Tidy Data Set The README.txt file helps us understand the variable names here. PREFIX - if prefixed with 't' (time) we will rename as 'time' - if prefixed with 'f' (frequency) we will rename 'freq' MEASUREMENT - the measurements do have some meaningful names and these do not need to be simplified. Examples are BodyAcc, GravityAcc, BodyAccJerk, BodyGyro BodyGyroJerk, BodyAccMag, GravityAccMag, BodyAccJerkMag, BodyGyroMag, BodyGyroJerkMag FUNCTION ON MEASUREMENT - the measurements have functions applied to them which we can identify with '-fn()', where 'fn' is the function name that is prefixed with '-'. These mean(), std(), mad(), max(), min(), sma(), energy() igr(), entropy(), arCoeff(), correlation(), etc. However the only two we care about are the mean stated as '-mean()' and standard deviation stated as '-std()' we will rename 'Mean' and 'Std' respectively. AXIS - some measurements have an axis direction denoted with '-X', '-Y', '-Z' and other variations. However the mean() and std() measurements make use of the simple variation '-X', '-Y', '-Z' so it is only these three we will rename MISC - there are other observation variations (see all 'angle' prefixed measurements) that are not covered by the above however since we will not need these in the final Tidy Data output we will not rename these and simply keep them as is. EXAMPLES - tBodyGyroJerk-mean()-X ==> timeBodyGyroJerkMeanX - fBodyAccJerk-std()-Z ==> freqBodyAccJerkStdZ
-
Creates a second, independent tidy data set with the average of each variable for each activity and each subject.
Since we need an average of each variable for each activity and subject this means we can safely exclude 'group', the variable used to distinguish between 'test' and 'train' data observations. Using the dplyr chaining commands (%>%) we will do this one step at a time - get all columns except group because we want to be left with activity, subject and all measurements but not group - group by activity and subject since we need to summarize on this - find the mean of each measurement against our group using the summarise_all() function call with an argument of funs(mean) meaning that we want the mean of each measurement As a final step, we use write.table to produce the text file based on the tidy data source we produced to be submitted with the project.
A copy of the tidy data set extracted from the run_analysis.R program following the instructions required for the project.
[1] Course Project Information: https://class.coursera.org/getdata-030/human_grading
[2] Study information: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones#
[3] Data Source: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip