Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between keras-metrics and scikit-learn #45

Open
david-b-6 opened this issue Aug 30, 2019 · 3 comments
Open

Discrepancy between keras-metrics and scikit-learn #45

david-b-6 opened this issue Aug 30, 2019 · 3 comments

Comments

@david-b-6
Copy link

david-b-6 commented Aug 30, 2019

Hi all,

Wondering if you might be able to shed some light on what's going on here. Is this a bug? Thanks.

I'm using:
tensorflow gpu 1.13.1
keras 2.2.4 (very latest pip installed form github repo)
keras-metrics 1.1.0
numpy 1.16.4
scikit-learn 0.21.2

Here's the situation...

I'm training a ResNet on a multiclass problem (seven classes total). I'm trying to track the precision, recall and F1 for each class at each epoch. If I compare the validation set output from the last epoch with the values in that scikit learn calculates in its classification report after calling predict, they are vastly different.

For example, after 3 epochs the precision, recall and F1 of each class in the validation set is:

val_precision: 0.5000
val_precision_1: 0.3333
val_precision_2: 0.6000
val_precision_3: 0.3333
val_precision_4: 0.5641
val_precision_5: 0.8972
val_precision_6: 0.3500

val_recall: 0.0312
val_recall_1: 0.0196
val_recall_2: 0.0275
val_recall_3: 0.0909
val_recall_4: 0.1982
val_recall_5: 0.8075
val_recall_6: 0.5000

val_f1_score: 0.0588
val_f1_score_1: 0.0370
val_f1_score_2: 0.0526
val_f1_score_3: 0.1429
val_f1_score_4: 0.2933
val_f1_score_5: 0.8500
val_f1_score_6: 0.4118

But the scikit-learns confusion matrix and classification report shows:

Confusion matrix
[[ 0 0 28 0 4 0 0]
[ 0 0 44 0 7 0 0]
[ 0 0 102 0 7 0 0]
[ 0 0 11 0 0 0 0]
[ 0 0 99 0 12 0 0]
[ 0 0 657 0 13 0 0]
[ 0 0 14 0 0 0 0]]

Classification Report
precision recall f1-score support

       0       0.00      0.00      0.00        32
       1       0.00      0.00      0.00        51
       2       0.11      0.94      0.19       109
       3       0.00      0.00      0.00        11
       4       0.28      0.11      0.16       111
       5       0.00      0.00      0.00       670
       6       0.00      0.00      0.00        14

    accuracy                                0.11       998
 macro avg       0.06      0.15      0.05       998

weighted avg 0.04 0.11 0.04 998

Here's my code:

import numpy as np
np.random.seed(1)

import tensorflow as tf
tf.set_random_seed(1)

import random as rn
rn.seed(1)

import keras
from keras import layers, models, optimizers
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import confusion_matrix, classification_report
from keras_applications.resnet import ResNet50
from math import ceil
import keras_metrics as km



train_images = np.load('path to tensor')
train_labels = np.load('path to tensor')

validation_images = np.load('path to tensor')
validation_labels = np.load('path to tensor')

input_height = 150
input_width = 150
input_depth = 3

num_train_images = len(train_images)
num_validation_images = len(validation_images)


steps_per_epoch = ceil(num_train_images / 32)
validation_steps = ceil(num_validation_images / 32)


train_labels = keras.utils.to_categorical(train_labels, 7)
validation_labels = keras.utils.to_categorical(validation_labels, 7)


train_datagen = ImageDataGenerator(rescale=1./255,
                                   dtype='float32')


val_datagen = ImageDataGenerator(rescale=1./255,
                                 dtype='float32')

train_datagen.fit(train_images)
val_datagen.fit(validation_images)


train_generator = train_datagen.flow(train_images,
                                     train_labels,
                                     batch_size=32)

validation_generator = val_datagen.flow(validation_images,
                                               validation_labels,
                                               batch_size=32)

pretrained = ResNet50(weights='imagenet',
                     backend=keras.backend,
                     layers=keras.layers,
                     models=keras.models,
                     utils=keras.utils,
                     include_top=False,
                     input_shape=(input_height, input_width, input_depth))


model = models.Sequential()
model.add(pretrained)
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(7, activation='softmax'))


model.compile(optimizer=optimizers.RMSprop(lr=0.00001),
                         loss='categorical_crossentropy',
                        metrics=['categorical_accuracy',
                       km.categorical_precision(label=0),
                       km.categorical_precision(label=1),
                       km.categorical_precision(label=2),
                       km.categorical_precision(label=3),
                       km.categorical_precision(label=4),
                       km.categorical_precision(label=5),
                       km.categorical_precision(label=6),
                       km.categorical_recall(label=0),
                       km.categorical_recall(label=1),
                       km.categorical_recall(label=2),
                       km.categorical_recall(label=3),
                       km.categorical_recall(label=4),
                       km.categorical_recall(label=5),
                       km.categorical_recall(label=6),
                       km.categorical_f1_score(label=0),
                       km.categorical_f1_score(label=1),
                       km.categorical_f1_score(label=2),
                       km.categorical_f1_score(label=3),
                       km.categorical_f1_score(label=4),
                       km.categorical_f1_score(label=5),
                       km.categorical_f1_score(label=6)
                       ])

with tf.Session() as s:
    s.run(tf.global_variables_initializer())
    history = model.fit_generator(train_generator,
                              steps_per_epoch=steps_per_epoch,
                              epochs=3,
                              validation_data=validation_generator,
                              validation_steps=validation_steps,
                              shuffle=True,
                              verbose=1)

    predictions = model.predict(validation_images)

    predicted_classes = np.argmax(predictions, axis=1)

    validation_labels = np.argmax(validation_labels, axis=1)

    c_matrix = confusion_matrix(validation_labels, predicted_classes)
    print(c_matrix)

    report = classification_report(validation_labels, predicted_classes)
    print(report)
@ybubnov
Copy link
Member

ybubnov commented Sep 2, 2019

@david-b-6, thank you for the issue. In the code above I don't see how you print the metrics from keras-metrics package, there is only evaluation through sklearn.

@ybubnov
Copy link
Member

ybubnov commented Sep 2, 2019

I've extended unit test to perform cross-validation with sklearn metrics: #46

@ybubnov
Copy link
Member

ybubnov commented Sep 2, 2019

It seems I understand your confusion now, let me explain.

keras-metrics are implemented as regular layers of the model, so they are part of the model's execution graph. So whenever you call fit of the model, all components of that graph are executed, including metrics.

Assuming previous statement: result of keras-metrics make sense to compare with sklearn result on evaluation of the model only, that's it.

Don't get confused with values printed during the model fitting, it's just part of the model's graph execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants