Fix bug on distributed training in mnist using MirroredStrategy API #5183

parkjaeman · 2018-08-24T09:37:42Z

I tried to run distributed tensorflow with mnist but it did not work. So I fixed such problem with MirroredStrategy API.

googlebot · 2018-08-24T09:37:45Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

parkjaeman · 2018-08-24T14:40:04Z

I signed it!

googlebot · 2018-08-24T14:40:07Z

CLAs look good, thanks!

karmel · 2018-08-24T16:16:47Z

Thanks for the PR, @parkjaeman . @robieta , @guptapriya -- can you take a look?

robieta

Two broad changes to make:

Remove the multi_gpu flag and replace it with num_gpus. (See other models such as wide deep and resnet) This will also allow you to use utils.misc.distribution_utils.get_distribution_strategy() to get the distribution.
Remove TowerOptimizer; it is not needed once replicate_model_fn is removed.

robieta · 2018-08-24T16:54:44Z

official/mnist/mnist.py

@@ -125,7 +125,10 @@ def model_fn(features, labels, mode, params):

    logits = model(image, training=True)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
-    accuracy = tf.metrics.accuracy(
+    if params.get('multi_gpu'):
+      accuracy = (tf.no_op(), tf.constant(0))


This should not be necessary. If you are having issues make sure your version of tf-nightly is up to date.

@robieta, I applied your comments to this PR like

Remove multi_gpu

Remove TowerOptimizer

Replace MirroredStrategy with distribution_utils.get_distribution_strategy()

And I checked mnist run without error when I add parameter '--num_gpus'.

- Remove multi-gpu - Remove TowerOptimizer - Change from MirroredStrategy to distribution_utils.get_distribution_strategy

robieta

LGTM. Thanks for looking into this.

Fix bug on distributed training in mnist using MirroredStrategy API

2fe1967

parkjaeman requested review from karmel and a team as code owners August 24, 2018 09:37

googlebot added the cla: no label Aug 24, 2018

googlebot added cla: yes and removed cla: no labels Aug 24, 2018

karmel requested review from robieta and guptapriya and removed request for karmel August 24, 2018 16:16

robieta reviewed Aug 24, 2018

View reviewed changes

Remove unnecessary codes and chagne distribution strategy source

0b6ee1a

- Remove multi-gpu - Remove TowerOptimizer - Change from MirroredStrategy to distribution_utils.get_distribution_strategy

parkjaeman force-pushed the master branch from 2570cba to 0b6ee1a Compare August 27, 2018 05:43

robieta added the kokoro:force-run label Aug 28, 2018

kokoro-team removed the kokoro:force-run label Aug 28, 2018

robieta approved these changes Aug 28, 2018

View reviewed changes

robieta merged commit 6a0dda1 into tensorflow:master Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug on distributed training in mnist using MirroredStrategy API #5183

Fix bug on distributed training in mnist using MirroredStrategy API #5183

Uh oh!

parkjaeman commented Aug 24, 2018

Uh oh!

googlebot commented Aug 24, 2018

Uh oh!

parkjaeman commented Aug 24, 2018

Uh oh!

googlebot commented Aug 24, 2018

Uh oh!

karmel commented Aug 24, 2018

Uh oh!

robieta left a comment

Uh oh!

robieta Aug 24, 2018

Uh oh!

parkjaeman Aug 25, 2018 •

edited

Loading

Uh oh!

robieta left a comment

Uh oh!

Uh oh!

Fix bug on distributed training in mnist using MirroredStrategy API #5183

Fix bug on distributed training in mnist using MirroredStrategy API #5183

Uh oh!

Conversation

parkjaeman commented Aug 24, 2018

Uh oh!

googlebot commented Aug 24, 2018

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

parkjaeman commented Aug 24, 2018

Uh oh!

googlebot commented Aug 24, 2018

Uh oh!

karmel commented Aug 24, 2018

Uh oh!

robieta left a comment

Choose a reason for hiding this comment

Uh oh!

robieta Aug 24, 2018

Choose a reason for hiding this comment

Uh oh!

parkjaeman Aug 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robieta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

parkjaeman Aug 25, 2018 •

edited

Loading