minollisantiago/GettingAndCleaningData

README.md for run_analysis.R


Author:	Brian von Konsky
Created:	April 2014
Course:	Getting and Cleaning Data
Repository:	https://github.com/bvonkonsky/GettingAndCleaningData

####Overview The R Script called run_analysis.R produces a tidy dataset from data originally collected by Anguita and colleagues (Anguita et al., 2012) and available on the UCI Machine Learning Repository.

Original Data

The original data captured motion attributes using accelerometers and gyroscopes embedded in Samsung Galaxy SII smartphones. Data were randomly partitioned to train and test and a system for classifying human activity. A goal was to develop an approach to non-invasively monitor the activity of the elderly.

Measurements were taken from experimental subjects wearing the smartphones while engaged in the following activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying. The training data consisted of 7352 observations of 561 variables from 21 subjects. The test data consisted of 2947 observations of the same 561 variables from 9 subjects.

The original data format stored test and training data in different sub-directories. Each sub-directory contained three files. The three files stored raw data, subject identification numbers, and integer values coding one of the 6 activities. Information for each observation was held in the corresponding line of each file.

####Tidy Data Produced by run_analysis.R The script run_analysis.R:

merges the training and test data by Subject ID for a given Activity;
retains mean and standard deviation attributes, assumed to be those attributes that contain mean or std anywhere in the attribute name, and drops others;
makes attribute names arguably more readable by changing abbreviations like Mag to Magnitude, Acc to Acceleration, and std to StandardDeviation, and removing parentheses, dots, and hyphens;
adds columns identifying the subject by SubjectID and Activity, where activity is shown as a human readable string rather than as an integer.
generates a tidy dataset of the merged data in Comma Separated Values (CSV) format in a file called tidyMerged.csv; and
generates a second tidy dataset in CSV format that contains the average of reported attributes by subject and activity in a file called tidyAveraged.csv.

See Also: CodeBook.md

####Running run_analysis.R To use run_analysis.R, do the following:

Download and install R and R Studio.
Obtain a copy of run_analysis.R from Github and store it in your project directory.
Run R Studio.
Use setwd("<project directory>") to set the working directory to your project directory containing the run_analysis.R script.
Use source("run_analysis.R") to run the script. If necessary, the script will download and unzip the original data into the current working directory. The original dataset is large, so please be patient. Not including the initial download and unzip, the script takes around 30 seconds to run on a 2.3 GHz Intel Core i7 iMac running Mac OS X 10.9.2.

####Functions

Functions in run_analysis.R are listed below.

main <- function()
Creates two tidy data frames for the training and test data and then merges these into a single tidy data frame. The merged data frame is written to a CSV file called tidyMerged.csv. The merged data frame is then averaged by activity for each subject, and written to a second CSV file called tidyAveraged.csv. Paths to files and directories in the original data are coded to work on all operating systems supported by R using the file.path() function.
getAndClean <- function(subjectsFilename, labelsFilename, dataFilename)
Recovers raw data, subject ID numbers, and activities from the three files used to store information from one of the two original data sets (either test or training) and combines data for that set into a single tidy data frame.
getData() <- function(fileName)
Gets the feature set and the raw data. Keeps columns with attribute names that contain mean or std, and edits these names to make them more readable. Drops those column names that did not originally contain mean or std.
getActivities() <- function(fileName)
Reads the Activity ID for each observation and converts this to a meaningful English verb (e.g. WALKING, STANDING).
getSubjectIDs() <- function(fileName)
Returns a list of SubjectIDs for each observation in the set.
getActivityLabels() <- function(filename)
Returns an ordered list of sequential activity labels for use as a lookup table in other functions.
averageTidy <- function(mergedDF)
Averages the data for each measurement in mergedDF for each combination of subject and activity. Uses aggregate() to quickly compute means.
downloadData() <- function()
Checks to see if a subdirectory with the original data exists in the current working directory. If not, the function downloads and unzips the original data from the UCI Machine Learning Repository.

####Potential Modifications Feature names contain a leading t to designate that the variable is in the time domain and f to denote that it is in the frequency domain. These could be expanded if desired, although this was not done in this case to avoid variable names becoming event longer and more unwieldly.

References

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012.

minollisantiago / GettingAndCleaningData

README.md for run_analysis.R

Original Data

References

About