Руководство по расчету средней матрицы ошибок для k-кратной перекрестной проверки - Fcodenotes

В машинном обучении оценка эффективности модели классификации имеет решающее значение. Одним из часто используемых методов является k-кратная перекрестная проверка, которая делит набор данных на k подмножеств (или сгибов) и выполняет несколько итераций обучения и тестирования. Расчет средней матрицы путаницы по этим итерациям дает ценную информацию об общей производительности модели. В этой статье мы рассмотрим различные методы вычисления средней матрицы путаницы, используя разговорный язык, и приведем примеры кода.

Метод 1: расчет вручную
Самый простой способ рассчитать среднюю матрицу путаницы — вручную суммировать матрицы путаницы, полученные в результате каждого сгиба, и разделить результат на k. Давайте посмотрим на фрагмент кода, чтобы проиллюстрировать это:

import numpy as np
# Assuming you have obtained confusion matrices for each fold
confusion_matrices = [...]  # Replace [...] with your confusion matrices
# Initialize the sum matrix
sum_matrix = np.zeros_like(confusion_matrices[0])
# Sum up the matrices
for matrix in confusion_matrices:
    sum_matrix += matrix
# Calculate the average matrix
average_matrix = sum_matrix / len(confusion_matrices)
print("Average Confusion Matrix:")
print(average_matrix)

Метод 2: sklearn.metrics.confusion_matrix
Библиотека scikit-learn предоставляет удобную функцию для вычисления матрицы путаницы. Использование этой функции в k-кратном цикле перекрестной проверки позволяет нам вычислить среднюю матрицу. Вот пример:

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
# Assuming you have a classifier 'clf' and the feature matrix 'X' and target vector 'y'
# Initialize the sum matrix
sum_matrix = None
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train the classifier
    clf.fit(X_train, y_train)
    # Predict the labels
    y_pred = clf.predict(X_test)
    # Compute the confusion matrix for the current fold
    fold_matrix = confusion_matrix(y_test, y_pred)
    # Accumulate the matrices
    if sum_matrix is None:
        sum_matrix = fold_matrix
    else:
        sum_matrix += fold_matrix
# Calculate the average matrix
average_matrix = sum_matrix / kf.get_n_splits(X)
print("Average Confusion Matrix:")
print(average_matrix)

Метод 3: перекрестная проверка с помощью StratifiedKFold
Другой способ вычисления средней матрицы путаницы включает использование класса StratifiedKFold из scikit-learn. Этот метод гарантирует, что каждая складка сохраняет то же распределение классов, что и исходный набор данных.

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
# Assuming you have a classifier 'clf' and the feature matrix 'X' and target vector 'y'
# Initialize the sum matrix
sum_matrix = None
# Initialize the StratifiedKFold object with k value
skf = StratifiedKFold(n_splits=k)
# Perform k-fold cross-validation
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train the classifier
    clf.fit(X_train, y_train)
    # Predict the labels
    y_pred = clf.predict(X_test)
    # Compute the confusion matrix for the current fold
    fold_matrix = confusion_matrix(y_test, y_pred)
    # Accumulate the matrices
    if sum_matrix is None:
        sum_matrix = fold_matrix
    else:
        sum_matrix += fold_matrix
# Calculate the average matrix
average_matrix = sum_matrix / k
print("Average Confusion Matrix:")
print(average_matrix)