Getting and Cleaning Data Course Project

by Steve Myles, February 2015

Assignment:

You should create one R script called run_analysis.R that does the following.

Merges the training and the test sets to create one data set.

Extracts only the measurements on the mean and standard deviation for each measurement.

Uses descriptive activity names to name the activities in the data set

Appropriately labels the data set with descriptive variable names.

From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Files:

README.md (this file)
run_analysis.R (the R script that does the merging and transformations)
CodeBook.md (describes the variables, the data, the transformations, and the work that is performed to clean up the data)

Script (run_analysis.R)

Assumptions

The Samsung data is available in the working directory and/or the user is connected to the Internet. The script checks:
If an unzipped folder named "UCI HAR Dataset" with the directory structure underneath that left intact from the original zip file (e.g., there are subdirectories called "test" and "train" in "UCI HAR Dataset") is available. If so, the script continues.
If the "UCI HAR Dataset" folder is not available, the script checks whether the original zip file is available. If so, it is unzipped and the script continues.
If the original zip file is not available, the script downloads the file from the Internet, unzips it, and continues the script.
The user has either previously installed the plyr, dplyr, and reshape2 packages or is connected to the Internet so the script can install them.
"Extracts only the measurements on the mean and standard deviation" is assumed to mean all variables from the original data set with the strings "Mean," "mean," or "std" in their names.

Process

run_analysis.R goes through the following process to meet the requirements. Please see the script for details about how each step is performed.

Check whether the UCI HAR Dataset is available in the working directory. If not, the script checks for the zip file, unzips it, and continues. If that is also unavailable, the script downloads the zip file, unzips it, and continues.
Merge the training and the test sets to create one data set (requirement #1)
Load the "plyr," "dplyr," and "reshape2" packages as they are necessary for the data manipulation that follows. If they have not been previously installed, the script installs them.
Read the "Activity Labels" and "Features" data into data frames, labeling the columns and changing the Feature data to character type.
Read the separate parts of the training data set (training activities, "y_train.txt;" training data, "X_train.txt;" and training subject, "subject_train.txt") and combine them into a single data frame using cbind.
Read the separate parts of the test data set (test activities, "y_test.txt;" test data, "X_test.txt;" and test subject, "subject_test.txt") and combine them into a single data frame using cbind.
Merge the training and test data sets into a single data frame using rbind.
Add activity labels from the "Activity Labels" data frame (Uses descriptive activity names to name the activities in the data set (requirement #3))
Name the columns based on the "Features" data (Appropriately labels the data set with descriptive variable names (requirement #4))
Extract only the measurements on the mean and standard deviation for each measurement (requirement #2)
Subset the merged data frame to the columns representing mean and standard deviation
To do this, search for all variable names containing the strings "Mean," "mean," and "std."
Create a second, independent tidy data set with the average of each variable for each activity and each subject (requirement #5)
Make all column names legal for R by removing all parentheses and dashes.
Change variable names to Camel Case (i.e., capitalize "Mean" and "Std" regardless of where in the column name they appear) to increase readability.
Change the names of the misnamed variables in the original data set (i.e., change "BodyBody" to "Body").
Change the "Subject" column to a factor for easier manipulation.
Sort the merged data frame by "Subject" and "Activity"
Calculate the average (mean) of each numeric column and return a data frame ("merged_summary" in the script) with the average for each subject/activity combination.
As instructed, write this data frame to a text file called "samsung_summary.txt" using "row.names = FALSE."

References

Sean C. Anderson. An Introduction to reshape2. Blog post on http://seananderson.ca/. Oct 2013.
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012.
"GitHub Flavored Markdown." GitHub Help. 2015.
Simon Hanlon. Response to "Elegant way to check for missing packages and install them?". Stack Overflow. Nov 2013.
David Hood. David's Course Project FAQ. Getting and Cleaning Data Discussion Forum. Coursera. Feb 2015.
David Hood. Response to "Final Step". Getting and Cleaning Data Discussion Forum. Coursera. Feb 2015.
David Hood. Response to "Codebook". Getting and Cleaning Data Discussion Forum. Coursera. Feb 2015.
Kenneth Karan. Response to "Any standard format for codebook?". Getting and Cleaning Data Discussion Forum. Coursera. Feb 2015.
Matthew Taylor. Response to "Silly situation: trouble with unzipping the DataSet.zip file". Response on Getting and Cleaning Data Discussion Forum. Coursera. Feb 2015.

scumdogsteev / getting-and-cleaning-data

Getting and Cleaning Data Course Project

by Steve Myles, February 2015

Assignment:

Files:

Script (run_analysis.R)

Assumptions

Process

References

About

Languages