ML EVALUATION MATRICS
When ACCURACY Would not work? Measuring performance is very subtle.We divide the training data into a) training set b) validation set and test set. The bigger the validation set the precise the perfomance accuracy.
Tip:Remember cross validation is a slow process, if possible in this case collecting more data is good.
What is accuracy: A simple metric Accuracy is Out of all the data points, how many are classified correctly. This is intuitively very easy to understand. The equation is as below:
If we see the below classification — Fig-2, we know that there are 6 data points, marked as blue correctly classified as positive and 5 marked as red are classified as negative. So total correctly classified as 11. So, Accuracy = 11/(11+3) i.e. 11/14 = 78.57% Typically, if the accuracy of predictive classifier above 90% considered to be good. Therefore, error rate is total number of incorrect predictions divided by all total predictions on the test data set.
Error Rate = Incorrectly classified points / Total data points
Hence, we can say that
Error rate = 1 — Accuracy and vice versa ( Accuracy = 1 — Error Rate )
shortcomings of Accuracy:
Accuracy is not always the method to use, and may lead to confusion. The classic example is credit card fraud detection. There are thousands of transactions which are genuine and very few are fraudulent. So, the accuracy could be 99%. Another example could be publicly available Enron dataset where the classifier classifies innocent and fraudulent employees. Out of thousands of employee, only few are fraudster. This led me to think about class imbalance problem. Achieving 95% or even 99% accuracy in imbalanced data set is quite trivial. So, it is not ideal for skewed classes. In this tutorial, you will learn how accuracy formulae above will be incorrect and misleading in case of imbalanced data set for classification model with a code example.
Any learning algorithm, the goal is to maximize accuracy ( optimization problem). Second, learning algorithm (in this case classifier) will build on the data drawn from the training set. As a result of these two assumptions, learning algorithm-classifier on imbalanced data sets produces unsatisfactory model. The reason is if 99% of data are from one class ( majority class) the most realistic problems a learning algorithm will be hard pressed to do better than 99%. This is achievable by considering all the labels from the majority class.
First we can create an imbalanced dataset 1:100 class distribution. We can generate synthetic data using SMOTE or make_blobs function of SKLearn. make_blobs() function generates classes with equal distribution. So, we can write a function to over sample one of the classes. SKLearn has DummyClassifier that makes predictions using simple rules. Here we will use real LogisticRegression classifier and build the model with imbalanced data.
All code is https://github.com/ppujari/udacity/blob/master/logistic_classifier_on_imbalanced_data.ipynb
Further Reading: