Tidy Data Set | Human Activity Recognition Using Smartphones Dataset

Analysis and conversion of raw data set to tidy data set was done in following steps

Extracting Feature Names

Extracting feature names from features.txt.
Since only the required features were needed from this huge list of 561 features.Required features were extracted from this list. This could have been done by either randomly searching through the whole list of features and pick columns names that have "mean" or "std" in their names using grep command. But to actually be sure of that we are picking only the necessary columns. Manually extracted the indexes of mean and time duration for time derived, eucleadian norm, fourier transform and angle derivation and use those indexes to extract required feature names and further the same indexes was used to extract required measurements from subject_train.txt & subject_test1.txt This was done as follows. - timeDerivedIndexes<-c(1:6,41:46,81:86,121:126,161:166) eucleadianNormIndexes<-c(201:202,214:215,227:228,240:241,253:254) fftIndexes<-c(266:271,345:350,424:429) angleIndexes<-c(555:561) requiredIndexes<-c(timeDerivedIndexes,eucleadianNormIndexes,fftIndexes,angleIndexes)

Required features(measurements) were extracted from test/train directory from the file subject_train.txt or subject_test.txt and were subsetted using the reqiuiredIndexes created in step 1.

Activity corresponding to each event was taken from y_train.txt or y_test.txt.

Subjectids, features(measurements) and activity vector extracted in step 2,3,4 were column binded using

 requiredSignals=cbind(subjectid,activityVector,requiredSignals)

Step 2,3 and 4 were executed first for train data set and then for test data set and finally both the results were row binded using rbind command.

Merged data obtained in previous step was sorted in the order of first by subject id and then by activity vector to have nice representation in the form of per subject per activity measurements.

Activity labels were taken from activity_labels.txt and a replacement vector was created to replace activity numbers by their actual labels.

activityLabels<-c("1"="WALKING","2"="WALKING_UPSTAIRS","3"="WALKING_DOWNSTAIRS","4"="SITTING","5"="STANDING","6"="LAYING")

Column names were converted into descriptive column names using gsub for replacing some of the patterns such as time for t frequency for f etc.

Data was summarized using summarise_each function of dplyr package using mean as a function and grouped_by subject id.

Finally as a last step cleaned data was written into a cleaned_data.txt file with row.names=FALSE and quote=FALSE parameter.