Summary

The task was to use inertial measurement unit data to learn whether a bicep curl was being performed correctly (label A) or in one of four other incorrect ways (labelled B,C,D or E).

The data and its collection was initially described here http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf.

Data was prepared for learning using only very simple cleaning methods to ensure that each feature was available for every sample (i.e. no NAs in data). A random forest was trained. Out of sample error was estimated to be 0.53% as assessed by the out-of-bag error estimate built into the randomForest method. This estimate was confirmed by assesing accuracy performance on validation data held out of the initial training data.

The final model was tested on 20 test samples. All were correctly classified.

Preparing the work environment

Libraries required for learning were installed and loaded.

# install.packages("rpart")
# install.packages("rpart.plot")
# install.packages("rattle")
# install.packages("randomForest")
# install.packages("caret")

library(rpart)
library(rattle)
library(randomForest)
library(caret)

Data retrieval

Training data was obtained from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv. Testing data was obtained from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv. For the purpose of this report, the data is included as part of the git repository to ensure reproducibility.

training <- read.table(file="pml-training.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
testing <-  read.table(file="pml-testing.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)

Preprocessing data

A single function was written to preprocess training and testing data identically. Briefly, all data columns were cast as numerics. Meta-data columns containing data irrelevant to learning the class were removed (columns 1 to 7). Columns containing any NAs were removed. The class column was cast as a factor.

Checks were performed to ensure that training and testing data had no NAs and that they had identical data structures.

clean <- function(theseData) {
  #cast all the "character" data columns to numerics
  for (i in 1:length(theseData)){
    if (i > 7 && i < 160 && is.character(theseData[1,i])){
      cat(paste("converting ", colnames(theseData[i]), " to numeric"))
      theseData[,i] = as.numeric(theseData[,i])
      count <- sum(theseData[,i], na.rm=TRUE)
      cat(paste(" which now has a column sum of ", count, "\n"))
      }
    else {
      cat(paste(i, " is ok\n"))
      }
    }
  
#remove columns 1:7 and any with NAs
numNAs <- apply(theseData, 2, function(x) sum(is.na(x)))
noNAs <- which(numNAs == 0)
theseColumns <- noNAs[8:length(noNAs)]
theseData2 <- theseData[,theseColumns]

#change the classe variable to a factor (last column in both training and testing data sets)
lastColumn <- length(theseData2)
theseData2[,lastColumn] <- as.factor(theseData2[,lastColumn])

return(theseData2)
}


#clean training data
cleanTraining <- clean(training)
dim(cleanTraining)
sum(is.na(cleanTraining))
str(cleanTraining)

#clean testing data
cleanTesting <- clean(testing)
dim(testing)
sum(is.na(cleanTesting))
str(cleanTesting)

Data partitioning

Training data was partitioned into training data (70%) and validation data (30%) subsets.

thisTraining <- cleanTraining

## Create training and test sets
set.seed(123)
inTrain <- createDataPartition(y=thisTraining$classe, p=0.7, list=FALSE)
training <- thisTraining[inTrain,]
validation <- thisTraining[-inTrain,]
dim(training)
## [1] 13737    53
dim(validation)
## [1] 5885   53

Learning

Multiple methods and packages were assessed for learning. A simple decision tree was attempted first.

Performance was poor - with an overall accuracy of only 0.552 (see below). This performance is better than randomly guessing the class; there are five classes in the learning problem and if we were to simply guess the class for each sample we would expect to be correct (have an accuracy of) 20% of the time.

# train the model
modFit <- train(classe ~ ., method="rpart", 
                trControl=trainControl(method='cv', number=5, verboseIter=TRUE, allowParallel=TRUE), 
                data=training) #<---rpart is an r package for decision tree learning

## plot decision tree
fancyRpartPlot(modFit$finalModel)
## Loading required package: rpart.plot
## Loading required package: RColorBrewer

plot of chunk decisionTree

## predicting new values
predictions <- predict(modFit,newdata=validation)  
# make the confusion matrix
confusionMatrix(predictions,validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1061  235   27   64   13
##          B  163  631   42  133  281
##          C  341  230  819  509  247
##          D  102   43  138  258   60
##          E    7    0    0    0  481
## 
## Overall Statistics
##                                         
##                Accuracy : 0.552         
##                  95% CI : (0.539, 0.565)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.437         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.634    0.554    0.798   0.2676   0.4445
## Specificity             0.919    0.870    0.727   0.9303   0.9985
## Pos Pred Value          0.758    0.505    0.382   0.4293   0.9857
## Neg Pred Value          0.863    0.890    0.945   0.8664   0.8886
## Prevalence              0.284    0.194    0.174   0.1638   0.1839
## Detection Rate          0.180    0.107    0.139   0.0438   0.0817
## Detection Prevalence    0.238    0.212    0.365   0.1021   0.0829
## Balanced Accuracy       0.777    0.712    0.763   0.5990   0.7215

Learning with a random forest approach

The appraoch used by the authors in the original paper was a bagging method involving 10 random forests. I tried a single random forest method next - something closer to that used in the original study.

The mtry parameter of the the randomForest method was first optimized using tuneRF.

#try directly using randomForest - see tutorial at http://scg.sdsu.edu/rf_r/
#first assess best value for mtry
best.mtry <- tuneRF(training[-53], 
                    training$classe, 
                    ntreeTry=100, 
                    stepFactor=1.5, 
                    improve=0.01, 
                    trace=TRUE,
                    plot=TRUE,
                    dobest=FALSE)
## mtry = 7  OOB error = 0.66% 
## Searching left ...
## mtry = 5     OOB error = 0.74% 
## -0.1209 0.01 
## Searching right ...
## mtry = 10    OOB error = 0.61% 
## 0.07692 0.01 
## mtry = 15    OOB error = 0.64% 
## -0.04762 0.01

plot of chunk randomForest

modfit <- randomForest(classe ~ ., data=training, mtry=10, ntree=1000, keep.forest=TRUE, importance=TRUE, test=testing)
#runtime less than 10 minutes - MacBook Air, 1.8 Ghz Intel i5, 4GB RAM

modfit
## 
## Call:
##  randomForest(formula = classe ~ ., data = training, mtry = 10,      ntree = 1000, keep.forest = TRUE, importance = TRUE, test = testing) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 0.53%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3902    3    0    0    1    0.001024
## B   11 2642    5    0    0    0.006020
## C    0   16 2377    3    0    0.007930
## D    0    0   24 2226    2    0.011545
## E    0    0    4    4 2517    0.003168

Out-of-sample error estimation

The out-of-bag error rate was estimated to be 0.53% (see above). Out-of-bag (oob) error is synonymous with out-of-sample error. Oob is a unbiased estimate of the test set error (out-of-sample error) that is based on cross validation and built into the random forest method itself.

The following explanation is from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr.

“In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.

Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests."

Since I had already set aside a validation set from my training data, I chose to also estimate out-of-sample error rate using an this unseen validation set. In this case, the out of sample error was estimated to be 0.5%. This constitutes just a 1X cross-validation and will not be as reliable as the oob error estimated by the randomForest method.

#make preditions on validation data
predictions <- predict(modfit,validation)

#make the confusion matrix
confusionMatrix(predictions,validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    5    0    0    0
##          B    1 1134   12    0    0
##          C    0    0 1014   11    0
##          D    0    0    0  952    0
##          E    0    0    0    1 1082
## 
## Overall Statistics
##                                         
##                Accuracy : 0.995         
##                  95% CI : (0.993, 0.997)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.994         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.996    0.988    0.988    1.000
## Specificity             0.999    0.997    0.998    1.000    1.000
## Pos Pred Value          0.997    0.989    0.989    1.000    0.999
## Neg Pred Value          1.000    0.999    0.998    0.998    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.172    0.162    0.184
## Detection Prevalence    0.285    0.195    0.174    0.162    0.184
## Balanced Accuracy       0.999    0.996    0.993    0.994    1.000

Predictions for test set and results

The model was used to predict classifications for the 20 samples in the test set. These predictions were prepared for submission. All 20 submissions were correct.

#apply the model to the real test data
answers <- predict(modfit,testing)

#########submit the answers - see https://class.coursera.org/predmachlearn-004/assignment/view?assignment_id=5

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(answers)