Skip to content

ImageClassifier with ExponentialLRDecay: metrics not updated/calculated during validation #4807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gartangh opened this issue Feb 7, 2020 · 7 comments · Fixed by #5255
Closed
Assignees
Labels
bug Something isn't working image Bugs related image datatype tasks P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@gartangh
Copy link

gartangh commented Feb 7, 2020

System information

  • OS version: Windows 10 Pro 18363
  • .NET Version: Core 2.1
  • Platform: x64
  • ML.NET version: 0.15-preview

Issue

  • What did you do?
    I am trying to train an image classifier that makes use of ExponentialLRDecay.
    I would like to see the metrics for training and validation for each epoch.

    var options = new ImageClassificationTrainer.Options()
    {
      LearningRateScheduler = new ExponentialLRDecay(),
      ValidationSetFraction = 0.1f,
      MetricsCallback = (metrics) => Console.WriteLine(metrics  + $"   CrossEntropy: {metrics.Train.CrossEntropy}, LearningRate: {metrics.Train.LearningRate})
    };
    var model = mlContext.MulticlassClassification.Trainers.ImageClassification(options);

    (I added + $" CrossEntropy: {metrics.Train.CrossEntropy}, LearningRate: {metrics.Train.LearningRate}", because the crossentropy and learning rate are not printed by default for the validation set.)

  • What happened?

    • The learning rate is not updated for the validation set (seen on every even row in the image).
    • The cross-entropy is not calculated for the validation set (seen on every even row in the image).
    • The learning rate is not updated after the second epoch like the default value of 2 for numEpochsPerDecay in ExponentialLRDecay(), but after the first instead (seen on the third row in the image). After that, the learning rate is correctly updated every 2 epochs. I'm not sure if this is the expected behavior.

    image

  • What did you expect?

    • I expected a decaying learning rate in the validation step, equal to the one in the training step.
    • I expected the cross-entropy to be calculated in the validation step. The model with the highest Accuracy and lowest CrossEntropy are the best, so if 2 models perform equally well in terms of accuracy, the one with the lowest cross-entropy on the validation set should be picked.
    • Further, I expected the learning rate to start decaying after the 2nd epoch.
@antoniovs1029 antoniovs1029 added bug Something isn't working P1 Priority of the issue for triage purpose: Needs to be fixed soon. labels Feb 7, 2020
@gartangh
Copy link
Author

gartangh commented Feb 7, 2020

After looking into some blog posts, I found that my point about the learning rate decaying too soon might be invalid.
Mostly, the learning rate indeed decays at epoch n-1 when starting to count epochs from 0 and numEpochsPerDecay=n.
So epoch 0 has the original learning rate and epoch 1 is already decayed when numEpochsPerDecay=2, or epoch 0-8 have the original learning rate and epoch 9 is decayed first when numEpochsPerDecay=10.
This is not really intuitive to me, but it might be correct!

@antoniovs1029 antoniovs1029 self-assigned this Feb 7, 2020
@harishsk harishsk added the image Bugs related image datatype tasks label Apr 29, 2020
@antoniovs1029 antoniovs1029 removed their assignment Jun 16, 2020
@mstfbl mstfbl self-assigned this Jun 18, 2020
@gartangh
Copy link
Author

gartangh commented Jun 18, 2020

I created a minimal working example based on the DeepLearning_ImageClassification_Training example, that reproduces the bug here.
I updated to version 1.5.0, changed the ImageClassificationTrainer.Options, and added a custom MetricsCallback.

  • The initial learning rate is still returned for the validation set. (In version 1.4.0, it did return the correct learning rate.)
  • The cross-entropy is still not calculated for the validation set.
  • The learning rate might decay as intended. (In version 1.4.0, it did decay after n epochs, instead of n-1, but that might not have been as intended.)

@mstfbl
Copy link
Contributor

mstfbl commented Jun 23, 2020

Hi @gartangh ,

Thank you so much for bringing this issue to our attention, and providing a good repro. I successfully replicated your repro with our local ML.NET build and confirmed the issue.

I figured out that the problem with cross entropy during validation is that it simply isn't being updated in validation metrics, as it is missing in the below code snippet:

if (statisticsCallback != null)
{
metrics.Train.Epoch = epoch;
metrics.Train.Accuracy /= metrics.Train.BatchProcessedCount;
metrics.Train.DatasetUsed = ImageClassificationMetrics.Dataset.Validation;
statisticsCallback(metrics);

The reason why learning rate is not decreasing during validation is because learning rate schedulers, which ExponentialLRDecay is one, are not currently being used in validation training. The null in the parameters below for TrainAndEvaluateClassificationLayerCore is where a learning rate scheduler would go:

TrainAndEvaluateClassificationLayerCore(epoch, learningRate, featureFileStartOffset,
metrics, labelTensorShape, featureTensorShape, batchSize,
validationSetLabelReader, validationSetFeatureReader, labelBuffer, featuresBuffer,
labelBufferSizeInBytes, featureBufferSizeInBytes, featureFileRecordSize, null,
trainState, validationEvalRunner, featureBufferPtr, labelBufferPtr,
(outputTensors, metrics) =>
{
outputTensors[0].ToScalar(ref accuracy);
metrics.Train.Accuracy += accuracy;
outputTensors[0].Dispose();
});

I don't see a reason why we don't support learning rate schedulers during validation, and as a result believe the learning rate during validation should also be decreasing due to the ExponentialLRDecay being used, and not remain stable (which it currently is at 0.01). Is my intuition correct @antoniovs1029 @harishsk ?

I'll be making a PR to address this issue and add tests to verify the changes soon.

In addition, I see that you add to manually log the learning rate and cross-entropy to see this bug. I see that these metrics are not reported during validation by default:

public override string ToString()
{
if (DatasetUsed == ImageClassificationMetrics.Dataset.Train)
return $"Phase: Training, Dataset used: {DatasetUsed.ToString(),10}, Batch Processed Count: {BatchProcessedCount,3}, Learning Rate: {LearningRate,10} " +
$"Epoch: {Epoch,3}, Accuracy: {Accuracy,10}, Cross-Entropy: {CrossEntropy,10}";
else
return $"Phase: Training, Dataset used: {DatasetUsed.ToString(),10}, Batch Processed Count: {BatchProcessedCount,3}, " +
$"Epoch: {Epoch,3}, Accuracy: {Accuracy,10}";
}
}

I wonder if this gives a hint on whether or not learning rates should be updated during validation with ExponentialLRDecay. I'm in favor of printing these metrics during validation as well.

@gartangh
Copy link
Author

Hi @mstfbl ,

You're welcome. I'm happy that I could be of help.

My guess is that the main part of this issue could be solved by copying that part of the code from training to validation?

About the learning rate during validation: one option would be to remove the LearningRate field from ImageClassificationMetrics.Train during validation, as it is not actually used. It was just very confusing to me that for training, the decay was visible, while for validation, it remained fixed at the initial value.
The other option would be as you say, but then there must be made sure that the learning rate does not decay twice as fast due to the extra usage during validation.

About your addition: if the cross-entropy is correctly updated, it must certainly be reported as well. It would be really awesome if the output signature would be exactly the same, i.e. also reporting the learning rate.
Then, the ToString method could look like this:

public override string ToString() 
{ 
    return $"Phase: Training, Dataset used: {DatasetUsed.ToString(),10}, Batch Processed Count: {BatchProcessedCount,3}, Learning Rate: {LearningRate,10} " + 
    $"Epoch: {Epoch,3}, Accuracy: {Accuracy,10}, Cross-Entropy: {CrossEntropy,10}"; 
} 

@mstfbl
Copy link
Contributor

mstfbl commented Jul 2, 2020

Hi @gartangh,

Thank you once again for notifying us about this issue. My now merged PR #5255 added cross entropy metric support for validation training. I did not add learning decay support for validation, as during validation there is no real training being done, so learning rate decay is not applicable here.

@gartangh
Copy link
Author

gartangh commented Jul 5, 2020

Hi @mstfbl

I just verified your changes by building the master branch locally.
This was the output:
image
This looks really good!
The cross-entropy and learning rate decay seem to work correctly.
Printing the learning rate during validation would probably be too confusing, so I understand why it is not used.

Thank you very much for your work.

@mstfbl
Copy link
Contributor

mstfbl commented Jul 6, 2020

Of course, anytime @gartangh ! Please feel free to make more issues/feature requests if you see any in the future, and I hope you continue to enjoy using ML.NET!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working image Bugs related image datatype tasks P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants