ROC Curve in Machine Learning
Hi , So today we would be learning about the most asked Topic in the data science interviews
ROC — Receiver operating Characteristic Curve
A Curve which show the performance of a classification model at all classification thresholds
Why we need ROC ?
Before we dive into ROC , lets understand why we need ROC so this curve is mainly used in the binary classification problems ( Predicting between two classes like spam or no spam ) if you have applied any classification problem we know that there is a sigmoid function which do a probability that whether a mail belong to class 1 or class 2 and to classify between those 2 classes we select a threshold like for example 0.5 ( used in most cases ) if it is less then 0.5 then its a spam otherwise not spam
So the question here is at which threshold value the classification is done accurately to solve this problem we take help of the ROC Curve
So to understand ROC better lets first undersatand below 3 topics
1 Confusion Matrix
A confusion matrix is a performance evaluation tool in machine learning, representing the accuracy of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives
for example if we want to test the accuracy of the mail spam classification model we just sum the TP and TN , if we do the evaluation for a test set containing 200 samples and the Sum ( TP + TN ) = 180 then the model has labeled 180 mails correctly out of 200 mails
2 True Positive Rate (benefit)
The true positive rate (TPR, also called sensitivity) is calculated as TP/TP+FN. TPR is the probability that an actual positive will test positive.
So lets take a example to understand it :
Netflix take a survey for 200 customers whether they will leave the subscription or not and we have TP , TN = 200 and FN, FP =40
lets calculate the TPR = 80/80+20 which comes to 80 % which means 80 % of the customers are correctly labeled
3 False Positive Rate (Cost)
The False positive rate (FPR ) is calculated as FP/FP+FN. FPR is the probability that an actual Negative will test positive.
Lets take a more example to understand this for a spam classification model if in a test set of 200 samples TP , TN are 20 and FP, FN are 80 each then FPR after applying the above formulae will be equal to 80% means 80 % of the mails are incorrectly classified and the cost will be equal to 80 to run the model
In short the best case is TPR = 100% or 1 and FPR = 0% or 0
ROC
A ROC is a simple graph between the TPR and FPR as shown in the diagram as we move ahead more towards the TPR which is 1 our model accuracy will be more precise as compared to FPR
For different threshold values we can create the confusion matrix and from that matrix we can calculate the TPR and FPR and plot in the graph and from that graph we can select the optimum or the best value (near to 1 )
Thanks !!!
#Machine learning #Roc curve #DataScience