Bayes Classifier Practical =========================== This practical will explore writing a Naive Bayes classifier in R. Before you start ---------------- Running R studio ^^^^^^^^^^^^^^^^ This practical uses R studio. Ensure that you can run R Studio. Installing the packages ^^^^^^^^^^^^^^^^^^^^^^^ The practical will use a number of R packages. Before you use these packages you need to load them:: library(doBy) library(ROCR) library(ggplot2) library(plyr) What do these packages do? * doBy - used to produce data grouped by a feature. * ROCR - used to produce receiver operator characteristic (ROC) curves * ggplot2 - graph plotting * plyr - tools for splitting, applying and combining data If any of these packages fail to load then you can install them with the commands below. Please do not do this unless you have to as R does not check if they exist before downloading the packages and it will take a while to install them:: install.packages("doBy") install.packages("ROCR") install.packages("ggplot2") install.packages("plyr") Downloading the files ^^^^^^^^^^^^^^^^^^^^^ Download the practical R files and data using git:: git clone https://allyhume@bitbucket.org/allyhume/NaiveBayesClassifierPractical.git The data set ------------ The data set we will use for this practical is the `Adult data set`_ from the UCI Machine Learning Repository. This data set can be build a classifier that predicts if adult Americans have an annual salary of less than or equal to $50,000 (<=50K) or above $50,000 (>50K). Look at the website for the `Adult data set`_ to understand it more. You will see that it has both continuous and categorical features. This practical will build a Naive Bayes classifier that uses both these types of features. .. _`Adult data set`: https://archive.ics.uci.edu/ml/datasets/Adult Set your R working directory to be the tutorial's R directory:: setwd("/PATH/TO/NaiveBayesClassifierPractical/R") The training and test data frames can be loaded using:: source("loadData.R") The training data frame is called ``training`` and the test data frame is called ``test``. Use the summary function to see quick summary of the data:: summary(training) You will notice that some of the data features are categorical and some are continuous. Notice also that the salary class in the training set is heavily biased towards <=50K. Spend a few minutes further understanding the data. Naive Bayes Classifier ---------------------- Recall that to implement a Naive Bayes Classifier we wish to use the following equation for each class to determine which class has highest probability of occurring given the feature data: .. image:: _static/pClassGivenData.png :width: 300px So first we need to determine the a priori probability of each class occurring. Assuming our training set is representative then this is easily done. The ``table`` function in R will be useful here:: table(training[["salary"]]) What is the a priori probability of class '<=50'? .. ANSWER: 24720/(24720+7841) = 0.7591904 What is the a priori probability of class '>50K'? .. ANSWER: 7841/(24720+7841) = 0.2408096 Next we need to consider the feature data. Let's start with the education feature. Again we can use R's ``table`` function to obtain the data we need:: table(training[["education"]], training[["salary"]]) This is the data we need to compute the probability of seeing the various feature values for each class. Look at the code in classifier1.R and complete the function so that it uses the education feature value when computing the score for each class. Test the function to ensure it produces the following results for training set instances 1 and 6:: > source('classifier1.R') > classifier1(training[1,], printScores=TRUE) [1] "<=50K: 0.0962501151684531 >50K: 0.0682104357974264" [1] <=50K Levels: <=50K >50K > classifier1(training[6,], printScores=TRUE) [1] "<=50K: 0.0234636528362151 >50K: 0.0294524123951967" [1] >50K Levels: <=50K >50K Now we can apply the classifier to the whole test set and analyse the result:: > predictions <- apply(test, 1, classifier1) > table(predictions, test[,"salary"], dnn=list("predicted", "actual")) actual predicted <=50K >50K <=50K 11881 3027 >50K 554 819 Here we have applied the classifier to all the test examples and produced a confusion matrix. If we consider >50K to the positive then the true positive rate is 819/(3027+819) = 21.3% and the false positive rate is 554/(11881+554) = 4.5%. This seems a very low true positive rate. Maybe using more features will improve matters. Edit the ``classifier1`` function to that it now includes the workclass feature. When completed you should see the following values for instances 1 and 6:: > source('classifier1.R') > classifier1(training[1,], printScores=TRUE) [1] "<=50K: 0.00394177069703957 >50K: 0.0031474880832015" [1] <=50K Levels: <=50K >50K > classifier1(training[6,], printScores=TRUE) [1] "<=50K: 0.0180316773887152 >50K: 0.0191074931656681" [1] >50K Levels: <=50K >50K Now applying this to the whole test set gives:: > predictions <- apply(test, 1, classifier1) > table(predictions, test[,"salary"], dnn=list("predicted", "actual")) actual predicted <=50K >50K <=50K 11764 2854 >50K 671 992 Is this better? We have more true positives (992 compared with 819) but we also have more false positives (671 compared with 554). Our true positive rate is now 25.8% and our false positive rate is now 5.4%. It is hard to tell if this is really any better. ROC Curves ---------- Choosing the best classifier requires considering a trade-off between between the true positive rate and the false positive rate. Our trained classifier has given as a single pair of true positive and false positive rates (TPR=25.8%, FPR=5.4%) but in fact it can support many other rate pairs if we simply adjust our sensitivity to predicting high salaries. We can adjust the rate simply by adding a weight to the score for high salaries. Let's add a weight of 2.0 to high salaries. This should result in more true positives at the expense of more false positives. Add a weight of 2.0 to the high salary score and see how well the classifier works now. The true positive rate should now be 53.4% with a false positive rate of 18.8%. If we adjust the weight for high salary we can get get a whole range of true positive rates for 0.0 to 1.0 (100%). If we alter our classifier code to return the ratio of high salary score to low salary score then we can easily adjust the weighting after we have applied the score all our training examples: * Make of copy of your classifier1.R file and call it classifier2.R * Remove any code you added to adjust the weight for the high salaries. * Change the name of the function from ``classifier1`` to ``classifier2``. * Change the function to return ``highSalaryScore/lowSalaryScore`` Now the threshold can be altered effectively adjust the weighting for a high salary. If the threshold is 1.0 we get the same results are before:: > source('classifier2.R') > ratios <- apply(test, 1, classifier2) > predictions <- sapply(ratios, function(x) { if (x >= 1.0) gt50K else lt50K}) > table(predictions, test[,"salary"], dnn=list("predicted", "actual")) actual predicted <=50K >50K <=50K 11764 2854 >50K 671 992 But if we set a lower threshold then we effectively increase the weight of the high salary class. Setting a threshold of 0.5 is equivalent to a high salary weighting of 2.0:: > predictions <- sapply(ratios, function(x) { if (x >= 0.5) gt50K else lt50K}) > table(predictions, test[,"salary"], dnn=list("predicted", "actual")) actual predicted <=50K >50K <=50K 10100 1792 >50K 2335 2054 Graphs showing all the possible true positive and false positive rate combinations are known as receiver operating characteristic (ROC) curves. They always start in the bottom left corner (TPR=FPR=0) and end in the top right corner (TPR=FTP=1.0). Typically the closer the curve gets to the top left corner the better. The RCOC library includes functions to draw ROC curves. Use these to draw a ROC curve for our classifier using the library's prediction and performance functions:: > library(ROCR) > ratios <- apply(test, 1, classifier2) > pred <- prediction(ratios, test["salary"]) > perf <- performance(pred,"tpr","fpr") > plot(perf) This curve should contain the TPR and FPR pairs identified above. Use R's ``abline`` function to draw lines for the known TPR and FPR rates to ensure the intersect on the ROC curve. Use ``?abline`` to view the documentation for the ``abline`` function. You will wish to use the ``v`` and ``h`` parameters when calling ``abline``. The area under the ROC curve is a useful single measure of the of how good the classifier is. The ``performance`` function can be used to return the area under the curve:: > performance(pred,"auc")@y.values [[1]] [1] 0.7247961 So the area under the ROC curve for a Naive Bayes classifier using just the education and workclass features is 0.72. Adding continuous features -------------------------- For categorical features such as the ones we have used so we can determine the required probability models simply by counting. This approach does not work for continuous data. For continuous data we must model the data using a appropriate distribution. Consider the age data in our data set. We can plot the density histogram of this feature for the low and high salary groups using the drawAgeHistograms.R script:: > source('drawAgeHistograms.R') You can see that there is a clear difference between the two groups although there is significant overlap. We wish to model this distributions using the normal distribution (also called the Gaussian distribution). The normal distribution is specified by two parameters: mean and standard deviation. Using the doBy library's ``summaryBy`` function we can easily calculate the mean and standard deviation for the two groups:: > library(doBy) > summaryBy(age~salary,data=training,FUN=c(mean,sd)) To see how well the normal distributions match the data we can plot them on top of the data:: > source('drawAgeHistogramsWithNormalModels.R') The normal model fits well for the high salary group but is less good a fit for the low salary group. Nonetheless we will carry on with these values and add the age feature into our classifier. Adjust your classifier to use the age feature: * Use the ``dnorm`` function for the normal probability density function * You may have to use the ``as.numeric`` function to convert the age to a number: * ``age <- as.numeric(x[["age"]])`` * Do not worry about storing the results from the ``summaryBy`` function and indexing them in your code. Feel free to keep things simple and type the mean and standard deviation numbers into your code for this exercise. Applying your classifier to the test set should produce the following confusion matrix:: > source('classifier2.R') > ratios <- apply(test, 1, classifier2) > predictions <- sapply(ratios, function(x) { if (x >= 1.0) gt50K else lt50K}) > table(predictions, test[,"salary"], dnn=list("predicted", "actual")) actual predicted <=50K >50K <=50K 11555 2523 >50K 880 1323 Re-plot the ROC curve for the new classifier. Does it look better? What is the area under the curve now. .. It should look better and the area under the curve should be 0.79 Optional additional work ------------------------ Add more features to your Naive Bayes classifier. Adding categorical features education number, marital status, occupation, relationship, race, sex and native country increases the area under the ROC curve to 0.88. If you get ``subscript out of bounds`` errors when accessing the table for education number it could be because R is strangely adding white space. Trying using the following function to trim the white space:: trim <- function( x ) { gsub("(^[[:space:]]+|[[:space:]]+$)", "", x) } Then read the value for education number as using:: f <- trim(x[["education.num"]]) Of the remaining continuous features only fnlwgt fits well onto the normal distribution but it seems to have little difference between the low and high salary cases.