stradaconsulting / gettingCleaningDataProject

Coursera's "Getting and Cleaning Data" course project repo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting and Cleaning Data Course Project Repo

Project Summary

    Coursera Course: Getting and Cleaning Data
    Assignment:      Course Project [1]
    Author:          Santiago Oleas

Project Purpose

Using data captured during the study Human Activity Recognition Using Smartphones Dataset Version 1.0 [2] gather the data and cleanse it based on the instructions described in the course project. The project has elements that are to be delivered either through github or through the Coursera web site:

  • link to this Github repository that is to be provided at the Coursera Project page
  • README.md: This file, which is in the Github repo.
  • CodeBook.md: A codebook describing the data output and the sources
  • run_analysis.R: An R program that takes the data from the study and transforms it based on instructions provided.
  • step5Output.txt Tidy Data Output from run_analysis.R

Repo Contents

README.md

This file, which provides an overview of the repo contents and describes the method that the run_analysis.R program follows

CodeBook.md

This file contains the following information pertaining to the project and the method for producing the final data output.

  • Information about the experimental design study
  • Information about the source variables, including the units.
  • Information about the transformation of any source variables, including summaries.

run_analysis.R

This file is the R program that takes the data provided, loads, transforms and extracts in the required format. ####Process

  • Acquire Data

      Get zip source data file from provided source [3]
      - Unzip file and observe contents
    
      Information is structured as follows: 
      - activity_labels.txt This activity list maps to the main observation data sets
      - features_info.txt   Describes data field measurement types. Important for answering 
                            further questions
      - features.txt        These are the "column" names for both sets of X_test.txt and X_train.txt data
      - README.txt          Overview doc
      
      ./UCI HAR Dataset/test/
      contains files
      - subject_test.txt    subject ID for each observation in X_test.txt
      - X_test.txt          detailed observations
      - y_test              activity ID for each observation in X_test.txt
      ./UCI HAR Dataset/test/Inertial Signals ***Can be ignored
      
      ./UCI HAR Dataset/train/
      contains files        
      - subject_train.txt   subject ID for each observation in X_train.txt
      - X_train.txt         detailed observations
      - y_train             activity ID for each observation in X_train.txt
      ./UCI HAR Dataset/train/Inertial Signals ***Can be ignored
    
      Each of train and test has 3 data files that must be combined.  They appear to be 3 different 
      sets of data sets pertaining to the same observations
      "subject_*.txt" has the ID of the subject while the observation took place
      "X_*.txt"       has the actual observations.  There are no COLUMN names here, these can be 
                      found in the "features.txt"
      "y_*.txt"       has the activiy label IDs.  These can later be joined with activity_labels.txt 
                      to know which activity we are dealing with
      
      So this is how we have to put this together. Picture these like blocks that we either stack 
      on top of or put next to each other. In some cases we force a label eg, "subject_id" or we 
      force a column value ("train" and "test" to differentiate the data source)
      ["group"]["subject_id"     ]["activity_id"][features.txt$description]
      ["train"][subject_test.txt ][y_test.txt   ][X_test.txt  ]
      ["test" ][subject_train.txt][y_train.txt  ][X_train.txt ]
      
      The sample data will look something like this after we are done 
    
group subject_id activity_id tBodyAcc-mean()-X tBodyAcc-mean()-Y ... angle(Z,gravityMean)
train 1 5 2.8858451e-001 -2.0294171e-02 ... -5.8626924E-02
...
test 2 5 2.5717778e-001 -2.3285230e-002 ... -6.7430222e-001
  • Merge the training and the test sets to create one data set.

      Load common data:
              - Load features.txt and place into 'features' data frame with column names 
                featureID and featureDesc
              - Load activity_levels.txt and place into 'activities' data frame with
                column names activtyID and activityDesc
                
      Load test data
              - Load subject_test.txt into testSubject data frame with column name of subjectID
              - Load y_test.txt into testActivity data frame with column name of activityID
              - Load X_test.txt into testReadings data frame and force features$featureDesc as
                column names
    
      Load Train data
              - Load subject_tRAIN.txt into tRAINSubject data frame with column name of subjectID
              - Load y_train.txt into trainActivity data frame with column name of activityID
              - Load X_train.txt into trainReadings data frame and force features$featureDesc as
                column names
               
      Combine all test data
              - place testSubject, testActivity and testReadings into a single data frame
                called test by placing the 3 data frames next to each other using cbind()
              - add an additional column called test$group with value of 'test' for every
                observation so that we can distinguish between test and train data when we
                combine it all together
                
      Combine all train data
              - place trainSubject, trainActivity and trainReadings into a single data frame
                called train by placing the 3 data frames next to each other using cbind()
              - add an additional column called train$group with value of 'train' for every
                observation so that we can distinguish between test and train data when we
                combine it all together
                
      Combine train and test data
              - using rbind combine the test and train data frames into a single new one
                called motionData
    
  • Extract only the Mean and Standard Deviation for each measurement.

      We need to keep the key varibles and the Mean and Standard Deviation. The 
      SELECT function from the dplyr function is used here. A new transformed data frame
      called motionDataMeanSTD is created here.  The required elements:
      - group
      - subjectID
      - activityID
      - all mean measurements:  contains("mean")
      - all standard deviation measurements:  contains('std')
      - exclude 'meanFreq' since that is not the same as mean: -contains('meanFreq')
      - exclude all of the 'angle' measurements. These are measuring an
        angle between gravity and another measurement and sometimes the
        word 'Mean' is in these measurements but they are not an actual
        mean of an observation as requested.
    
  • Use descriptive activity names to name the activities in the data set

      We can use the MERGE command to combine the activities and the
      motionDataMeanSTD data frames into a new one called motionDataMeanSTDActivity.
      They both have activityID. We will get activities$activityDesc, which is what
      we need to include to fulfill the requirements for this requirement.
      We can then drop motionDataMeanSTDActivity$activityID
    
  • Appropriately labels the data set with descriptive variable names.

      We will now create feature descriptions that are meaningful
      This is important to help us with obtaining a Tidy Data Set
      The README.txt file helps us understand the variable names
      here.
      PREFIX
      - if prefixed with 't' (time) we will rename as 'time'
      - if prefixed with 'f' (frequency) we will rename 'freq'
      MEASUREMENT
      - the measurements do have some meaningful names
        and these do not need to be simplified. Examples 
        are BodyAcc, GravityAcc, BodyAccJerk, BodyGyro
        BodyGyroJerk, BodyAccMag, GravityAccMag, 
        BodyAccJerkMag, BodyGyroMag, BodyGyroJerkMag
      FUNCTION ON MEASUREMENT
      - the measurements have functions applied to them
        which we can identify with '-fn()', where 'fn' is 
        the function name that is prefixed with '-'.  These 
        mean(), std(), mad(), max(), min(), sma(), energy()
        igr(), entropy(), arCoeff(), correlation(), etc.
        However the only two we care about are the 
        mean stated as  '-mean()' and standard deviation 
        stated as '-std()' we will rename 'Mean' and 'Std' 
        respectively.
      AXIS
      - some measurements have an axis direction denoted with 
        '-X', '-Y', '-Z' and other variations.  However the
        mean() and std() measurements make use of the simple
        variation '-X', '-Y', '-Z' so it is only these three
        we will rename 
      MISC
      - there are other observation variations (see all 'angle'
        prefixed measurements) that are not covered by the above
        however since we will not need these in the final Tidy Data
        output we will not rename these and simply keep them as is.
      EXAMPLES
      - tBodyGyroJerk-mean()-X ==> timeBodyGyroJerkMeanX
      - fBodyAccJerk-std()-Z   ==> freqBodyAccJerkStdZ
    
  • Creates a second, independent tidy data set with the average of each variable for each activity and each subject.

      Since we need an average of each variable for each activity and subject
      this means we can safely exclude 'group', the variable used to distinguish
      between 'test' and 'train' data observations.
      Using the dplyr chaining commands (%>%) we will do this
      one step at a time
      - get all columns except group because we want to be left with
        activity, subject and all measurements but not group
      - group by activity and subject since we need to summarize on this
      - find the mean of each measurement against our group using the
        summarise_all() function call with an argument of funs(mean)
        meaning that we want the mean of each measurement
      As a final step, we use write.table to produce the text file based on the
      tidy data source we produced to be submitted with the project.
    

step5Output.txt

A copy of the tidy data set extracted from the run_analysis.R program following the instructions required for the project.

References

[1] Course Project Information: https://class.coursera.org/getdata-030/human_grading

[2] Study information: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones#

[3] Data Source: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

About

Coursera's "Getting and Cleaning Data" course project repo


Languages

Language:R 100.0%