Confusion matrix & f1-score

The last modifications of this post were around 4 years ago, some information may be outdated!

On this page

Confusion matrix
Precision & Recall
1. When to use?
2. Precision / Recall curve
F1-Score
1. When to use F1-Score?
2. How to choose f1-score value?
Accuracy / Specificity
1. When to use?
The ROC Curve
1. The AUC
Confusion Matrix & F1-Score with Scikit-learn
References

Confusion matrix

	actual (yes)	actual (no)
predict (yes)	TP	FP
predict (no)	FN	TN

True Positive (TP): what we predict Positive is really Positive.
True Negative (FN): what we predict Negative is really Negative.
False Negative (FN): what we predict Negative is actually Positive.
False Positive (FP): what we predict Positive is actually Negative.

This guy is pregnant?

How to remember?

True/False indicates what we predicted is right/wrong.
Positive/Negative is what we predicted (yes or no).

Type I / Type II errors

FP = Type I error = rejection of true null hypothesis = negative results are predicted wrongly = what we predict positive is actually negative.
FN = Type II error = non-rejection of a false null hypothesis = positive results are predicted wrongly = what we predict negative are actually positive.

Why CM is important?

Give a general view about our model, "is it really good?" thanks to precision and recall!

Precision & Recall

	actual (yes)	actual (no)
predict (yes)	TP	FP	Precision
predict (no)	FN	TN
	Recall

Precision: How many of our positive predictions are really true? (Check the accuracy of our positive predictions).
$\mathrm {precision}= \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, predicted\, results}}= \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}.$
Recall: How many of positive results belong to our predictions? (Do we miss some negative predictions?)
$\mathrm {recall}= \dfrac{\mathrm{true\, positive}}{\mathrm{positively\, actual\, results}}= \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}.$

An example of using confusion matrix
Recognizing number 5. Figure taken from this book.

When to use?

Precision is importantly used when the "wrongly predicted yes" (FP) influences much (e.g. This email is spam? -- results yes but actually no and we lost important emails!).
Recall (Sensitivity) is importantly used when the "wrongly predicted no" (FN) influences much (e.g. In the banking industry, this transaction is fraudulent? -- results no but actually yes and we lost money!).

Precision / Recall curve

With thresholds, we can use precision_recall_curve() to compute precision and recall for all possible thresholds,

An example of Precision/Recall curve with many thresholds. Figure taken from this book.

Trace-off: Higher precision, lower recall and vice versa.

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.show()

F1-Score

High precision and low recall or vice versa? F1-Score gives us a balance between precision and recall.

f_1 = \left({\frac {\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}{2}}\right)^{-1}=2\times {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}.

F1-score depends on how we label the class "positive". This email is spam? is very different from This email is not spam?

When to use F1-Score?

When you need a balance between precision and recall.
When we have a "skewed class" problem (uneven class distribution, too many "yes" and very few "no", for example).
One of precision and recall is improved but the other changes too much, then f1-score will be very small!

How to choose f1-score value?

Normally, $f_1\in (0,1]$ and it gets the higher values, the better our model is.

The best one ( $f_1=1$ ), both precision and recall get $100\%$ .
One of precision and recall gets very small value (close to 0), $f_1$ is very small, our model is not good!

What if we prefer one of precision and recall than the other? We consider $f_{\beta}$ ^[ref]

f_{\beta} = ( 1 + \beta^2)\frac{\text{precision}\cdot\text{recall}}{\beta^2\cdot\text{precision} + \text{recall}}

$f_1$ is a special case of $f_{\beta}$ when $\beta=1$ :

When precision is more important than recall, we choose $\beta < 1$ (usually choose $\beta=0.5$ ).
When recall is more important than precision, we choose $\beta > 1$ (usually choose $\beta=2$ ).

Accuracy / Specificity

Accuracy: How accurate our predictions to the whole predictions?
$\mathrm{accuracy} = \dfrac{TP + TN}{TP + TN + FP + FN}$
Specificity: How many negative results belong to our predictions?
$\mathrm{specificity} = \dfrac{TN}{FP + TN}$

When to use?

Accuaracy is used when we have symmetric datasets.
Specificity is used when we care about TN values and don't want to make false alarms of the FP values (e.g. drug test).

The ROC Curve

ROC = Receiver operating characteristic.
A common tool used with binary classifier.
Diffrent from precision/recall curve, ROC plots the true positive rate (recall) against the false positive rate (1 - specificity).

An example of ROC curve
This ROC curve plots FPR vs TPR for all possible thresholds. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that lines as possible (toward the top-left corner). Figure taken from this book.

Trade-off: the higher recall, the more FPR (predict wrong) the classifier produces.

from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
%matplotlib inline

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# create plot
plt.plot(fpr, tpr, label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
plt.show()

The AUC

AUC = Area under the curve.
Perfect classifier will have AUC = 1 (fix the rectangle).
The purely random classifier (dotted line) will have AUC = 0.5.

Confusion Matrix & F1-Score with Scikit-learn

from sklearn.metrics import confusion_matrix
n_classes = target.shape[0]
confusion_matrix(y_true, y_pred, labels=range(n_classes))

Precision / Reacall / f1-score / support

from sklearn.metrics import classification_report
classification_report(y_test, y_pred)

References

Classification: Precision and Recall - Google Developers, Machine Learning Crash Course.
Classification: Check Your Understanding (Accuracy, Precision, Recall) - Google Developers, Machine Learning Crash Course.
F-measure versus Accuracy - NLP blog.
Accuracy, Precision, Recall or F1? - Koo Ping Shung, Towards Data Science.
Dealing with Imbalanced data: undersampling, oversampling and proper cross-validation - Marco Altini.
Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on? - Salma Ghoneim, Towards Data Science.

Confusion matrix

How to remember?

Type I / Type II errors

Why CM is important?

Precision & Recall

When to use?

Precision / Recall curve

F1-Score

When to use F1-Score?

How to choose f1-score value?

Accuracy / Specificity

When to use?

The ROC Curve

The AUC

Confusion Matrix & F1-Score with Scikit-learn

References

💬 Comments