Discrepancy between keras-metrics and scikit-learn #45

david-b-6 · 2019-08-30T14:29:36Z

Hi all,

Wondering if you might be able to shed some light on what's going on here. Is this a bug? Thanks.

I'm using:
tensorflow gpu 1.13.1
keras 2.2.4 (very latest pip installed form github repo)
keras-metrics 1.1.0
numpy 1.16.4
scikit-learn 0.21.2

Here's the situation...

I'm training a ResNet on a multiclass problem (seven classes total). I'm trying to track the precision, recall and F1 for each class at each epoch. If I compare the validation set output from the last epoch with the values in that scikit learn calculates in its classification report after calling predict, they are vastly different.

For example, after 3 epochs the precision, recall and F1 of each class in the validation set is:

val_precision: 0.5000
val_precision_1: 0.3333
val_precision_2: 0.6000
val_precision_3: 0.3333
val_precision_4: 0.5641
val_precision_5: 0.8972
val_precision_6: 0.3500

val_recall: 0.0312
val_recall_1: 0.0196
val_recall_2: 0.0275
val_recall_3: 0.0909
val_recall_4: 0.1982
val_recall_5: 0.8075
val_recall_6: 0.5000

val_f1_score: 0.0588
val_f1_score_1: 0.0370
val_f1_score_2: 0.0526
val_f1_score_3: 0.1429
val_f1_score_4: 0.2933
val_f1_score_5: 0.8500
val_f1_score_6: 0.4118

But the scikit-learns confusion matrix and classification report shows:

Confusion matrix
[[ 0 0 28 0 4 0 0]
[ 0 0 44 0 7 0 0]
[ 0 0 102 0 7 0 0]
[ 0 0 11 0 0 0 0]
[ 0 0 99 0 12 0 0]
[ 0 0 657 0 13 0 0]
[ 0 0 14 0 0 0 0]]

Classification Report
precision recall f1-score support

       0       0.00      0.00      0.00        32
       1       0.00      0.00      0.00        51
       2       0.11      0.94      0.19       109
       3       0.00      0.00      0.00        11
       4       0.28      0.11      0.16       111
       5       0.00      0.00      0.00       670
       6       0.00      0.00      0.00        14

    accuracy                                0.11       998
 macro avg       0.06      0.15      0.05       998

weighted avg 0.04 0.11 0.04 998

Here's my code:

import numpy as np
np.random.seed(1)

import tensorflow as tf
tf.set_random_seed(1)

import random as rn
rn.seed(1)

import keras
from keras import layers, models, optimizers
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix, classification_report
from keras_applications.resnet import ResNet50
from math import ceil
import keras_metrics as km



train_images = np.load('path to tensor')
train_labels = np.load('path to tensor')

validation_images = np.load('path to tensor')
validation_labels = np.load('path to tensor')

input_height = 150
input_width = 150
input_depth = 3

num_train_images = len(train_images)
num_validation_images = len(validation_images)


steps_per_epoch = ceil(num_train_images / 32)
validation_steps = ceil(num_validation_images / 32)


train_labels = keras.utils.to_categorical(train_labels, 7)
validation_labels = keras.utils.to_categorical(validation_labels, 7)


train_datagen = ImageDataGenerator(rescale=1./255,
                                   dtype='float32')


val_datagen = ImageDataGenerator(rescale=1./255,
                                 dtype='float32')

train_datagen.fit(train_images)
val_datagen.fit(validation_images)


train_generator = train_datagen.flow(train_images,
                                     train_labels,
                                     batch_size=32)

validation_generator = val_datagen.flow(validation_images,
                                               validation_labels,
                                               batch_size=32)

pretrained = ResNet50(weights='imagenet',
                     backend=keras.backend,
                     layers=keras.layers,
                     models=keras.models,
                     utils=keras.utils,
                     include_top=False,
                     input_shape=(input_height, input_width, input_depth))


model = models.Sequential()
model.add(pretrained)
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(7, activation='softmax'))


model.compile(optimizer=optimizers.RMSprop(lr=0.00001),
                         loss='categorical_crossentropy',
                        metrics=['categorical_accuracy',
                       km.categorical_precision(label=0),
                       km.categorical_precision(label=1),
                       km.categorical_precision(label=2),
                       km.categorical_precision(label=3),
                       km.categorical_precision(label=4),
                       km.categorical_precision(label=5),
                       km.categorical_precision(label=6),
                       km.categorical_recall(label=0),
                       km.categorical_recall(label=1),
                       km.categorical_recall(label=2),
                       km.categorical_recall(label=3),
                       km.categorical_recall(label=4),
                       km.categorical_recall(label=5),
                       km.categorical_recall(label=6),
                       km.categorical_f1_score(label=0),
                       km.categorical_f1_score(label=1),
                       km.categorical_f1_score(label=2),
                       km.categorical_f1_score(label=3),
                       km.categorical_f1_score(label=4),
                       km.categorical_f1_score(label=5),
                       km.categorical_f1_score(label=6)
                       ])

with tf.Session() as s:
    s.run(tf.global_variables_initializer())
    history = model.fit_generator(train_generator,
                              steps_per_epoch=steps_per_epoch,
                              epochs=3,
                              validation_data=validation_generator,
                              validation_steps=validation_steps,
                              shuffle=True,
                              verbose=1)

    predictions = model.predict(validation_images)

    predicted_classes = np.argmax(predictions, axis=1)

    validation_labels = np.argmax(validation_labels, axis=1)

    c_matrix = confusion_matrix(validation_labels, predicted_classes)
    print(c_matrix)

    report = classification_report(validation_labels, predicted_classes)
    print(report)

The text was updated successfully, but these errors were encountered:

ybubnov · 2019-09-02T08:20:46Z

@david-b-6, thank you for the issue. In the code above I don't see how you print the metrics from keras-metrics package, there is only evaluation through sklearn.

ybubnov · 2019-09-02T09:00:29Z

I've extended unit test to perform cross-validation with sklearn metrics: #46

ybubnov · 2019-09-02T09:07:16Z

It seems I understand your confusion now, let me explain.

keras-metrics are implemented as regular layers of the model, so they are part of the model's execution graph. So whenever you call fit of the model, all components of that graph are executed, including metrics.

Assuming previous statement: result of keras-metrics make sense to compare with sklearn result on evaluation of the model only, that's it.

Don't get confused with values printed during the model fitting, it's just part of the model's graph execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between keras-metrics and scikit-learn #45

Discrepancy between keras-metrics and scikit-learn #45

david-b-6 commented Aug 30, 2019 •

edited by ybubnov

Loading

ybubnov commented Sep 2, 2019

ybubnov commented Sep 2, 2019

ybubnov commented Sep 2, 2019 •

edited

Loading

Discrepancy between keras-metrics and scikit-learn #45

Discrepancy between keras-metrics and scikit-learn #45

Comments

david-b-6 commented Aug 30, 2019 • edited by ybubnov Loading

ybubnov commented Sep 2, 2019

ybubnov commented Sep 2, 2019

ybubnov commented Sep 2, 2019 • edited Loading

david-b-6 commented Aug 30, 2019 •

edited by ybubnov

Loading

ybubnov commented Sep 2, 2019 •

edited

Loading