6.1. Classification

Below is a detailed list of metrics commonly used to evaluate the accuracy and performance of classification and regression models in machine learning, including neural networks. The metrics are categorized based on their applicability to classification or regression tasks, with explanations of their purpose and mathematical formulations where relevant.

Classification Metrics

Classification tasks involve predicting discrete class labels. The following metrics assess the accuracy and effectiveness of such models:

Metric	Purpose	Use Case
Accuracy \( \displaystyle \frac{TP + TN}{TP + TN + FP + FN} \)	Measures the proportion of correct predictions across all classes	Suitable for balanced datasets but misleading for imbalanced ones
Precision \( \displaystyle \frac{TP}{TP + FP} \)	Evaluates the proportion of positive predictions that are actually correct	Important when false positives are costly (e.g., spam detection)
Recall (Sensitivity) \( \displaystyle \frac{TP}{TP + FN} \)	Assesses the proportion of actual positives correctly identified	Critical when false negatives are costly (e.g., disease detection)
F1-Score \( \displaystyle 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \)	Harmonic mean of precision and recall, balancing both metrics	Useful for imbalanced datasets where both precision and recall matter
AUC-ROC Area under the curve plotting True Positive Rate (Recall) vs. False Positive Rate \( \displaystyle \left( \frac{FP}{FP + TN} \right) \)	Measures the model’s ability to distinguish between classes across all thresholds	Effective for binary classification and assessing model robustness
AUC-PR Area under the curve plotting Precision vs. Recall	Focuses on precision and recall trade-off, especially for imbalanced datasets	Preferred when positive class is rare (e.g., fraud detection)
Confusion Matrix¹	Provides a tabular summary of prediction outcomes (TP, TN, FP, FN)	Offers detailed insights into class-specific performance, especially for multi-class problems
Hamming Loss \( \displaystyle \frac{1}{N} \sum_{i=1}^N \frac{1}{L} \sum_{j=1}^L \mathbf{1}(y_{ij} \neq \hat{y}_{ij}) \)	Calculates the fraction of incorrect labels to the total number of labels	Suitable for multi-label classification tasks
Balanced Accuracy \( \displaystyle \frac{1}{C} \sum_{i=1}^C \frac{TP_i}{TP_i + FN_i} \)	Average of recall obtained on each class, useful for imbalanced datasets	Effective for multi-class problems with class imbalance

Loss Functions

Loss functions commonly used in classification tasks:

Metric	Purpose	Use Case
Cross-Entropy Loss \( \displaystyle -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] \)	Measures the performance of a classification model whose output is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label.	Commonly used in classification tasks with probabilistic outputs.
Binary Cross-Entropy \( \displaystyle -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] \)	Used for binary classification tasks, measuring the difference between two probability distributions.	Commonly used in binary classification problems.
Categorical Cross-Entropy \( \displaystyle -\sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c}) \)	Used when there are two or more label classes. It is a generalization of binary cross-entropy to multi-class problems.	Suitable for multi-class classification tasks with one-hot encoded labels.
Sparse Categorical Cross-Entropy \( \displaystyle -\sum_{i=1}^{N} \log(\hat{y}_{i,y_i}) \)	Similar to categorical cross-entropy but used when labels are provided as integers rather than one-hot encoded vectors.	Useful for multi-class classification with integer labels.
Balanced Cross-Entropy \( \displaystyle -\frac{1}{N} \sum_{i=1}^{N} \left[ w_1 y_i \log(\hat{y}_i) + w_0 (1 - y_i) \log(1 - \hat{y}_i) \right] \)	Adjusts the standard cross-entropy loss to account for class imbalance by weighting classes inversely proportional to their frequency.	Useful in imbalanced classification tasks.
Kullback-Leibler Divergence \( \displaystyle D_{KL}(P \\| Q) = \sum_{i} P(i) \log\left(\frac{P(i)}{Q(i)}\right) \)	Measures how one probability distribution diverges from a second, expected probability distribution. It is often used in variational autoencoders and other probabilistic models.	Useful in scenarios involving probabilistic models and distributions.
Hinge Loss \( \displaystyle \sum_{i=1}^{N} \max(0, 1 - y_i \cdot \hat{y}_i) \)	Used for "maximum-margin" classification, primarily for support vector machines (SVMs). It is designed to ensure that the correct class is not only predicted but also separated from the decision boundary by a margin.	Effective for SVMs and tasks requiring a margin between classes.
Focal Loss \( \displaystyle -\frac{1}{N} \sum_{i=1}^{N} \alpha_t (1 - p_t)^\gamma \log(p_t) \)	A modified version of cross-entropy loss that addresses class imbalance by down-weighting easy examples and focusing training on hard negatives.	Beneficial in scenarios with significant class imbalance, such as object detection.
Multi-Class Log Loss \( \displaystyle -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c}) \)	Extends binary log loss to multi-class classification problems, penalizing incorrect predictions based on predicted probabilities.	Suitable for multi-class classification tasks.
Hamming Loss \( \displaystyle \frac{1}{N} \sum_{i=1}^N \frac{1}{L} \sum_{j=1}^L \mathbf{1}(y_{ij} \neq \hat{y}_{ij}) \)	Measures the fraction of incorrect labels to the total number of labels, useful for multi-label classification tasks.	Effective for multi-label classification scenarios.

Additional

Explanation of ROC Curve (AUC-ROC)

An ROC curve plots the True Positive Rate (TPR, or sensitivity/recall) against the False Positive Rate (FPR) at various classification thresholds. It helps visualize the trade-off between sensitivity and specificity for a classifier:

True Positive Rate (TPR): The proportion of actual positives correctly identified (TP / (TP + FN)).
False Positive Rate (FPR): The proportion of actual negatives incorrectly classified as positives (FP / (FP + TN)).
The Area Under the Curve (AUC) quantifies the overall performance, with AUC = 1 indicating a perfect classifier and AUC = 0.5 indicating a random classifier.

Confusion Matrix ↩