Skip to content

GLUE evaluation automation script #848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 29, 2023

Conversation

susnato
Copy link
Contributor

@susnato susnato commented Mar 14, 2023

What does this PR do?

Fixes #764

@chenmoneygithub

@susnato
Copy link
Contributor Author

susnato commented Mar 14, 2023

Hi @chenmoneygithub I uploaded the script, please check it. I tested it for some models and tasks, and they are working. My time limit for Colab GPU is over for today, Tomorrow I will train those models (except BERT and Roberta) for 2 epochs and and share the link.

@susnato susnato changed the title Glue nightly GLUE evaluation automation script Mar 14, 2023
@mattdangerw
Copy link
Member

mattdangerw commented Mar 14, 2023

Sorry probably missed some of the discussion leading up to this, but what is the point of this script over glue.py? Seems better for long term maintenance to just either adapt glue.py to work with multiple models or have these evaluations be one off modifications to that script that we don't land.

I would definitely see why we might want additional functionality around hyper-parameter search, or easier deployment to a GCP instance, but as is this seems too close to glue.py to merit a whole separate script. Will wait for @chenmoneygithub to weigh in.

@chenmoneygithub
Copy link
Contributor

@susnato Thanks for the PR!

@mattdangerw The purpose is to make a light-weight benchmark which could just work by a command, similar to this. We need it because sentiment analysis is too simple.

@susnato I think what we want is a light-weight benchmark script which reports accuracy/time elapse on a selected GLUE task, let's just use MRPC. As Matt pointed out, the current script is following the examples/glue_benchmark fashion, which is too broad to be used as a light-weight benchmark. Ideally with your PR checked in, we should be able to run this command to get metrics we are interested in:

python3 keras_nlp/benchmarks/glue_mrpc.py \
  --model="BertClassifier" \
  --preset="bert_small_en_uncased" 
  -- {other flags} 

For the current PR, I think you can delete code about tasks other than glue/mrpc, and we can rename the file to glue_mrpc.py. Also to retrieve the model, you can copy this function: link

Thanks for your contribution! 🍷

@mattdangerw
Copy link
Member

mattdangerw commented Mar 15, 2023

Got it! That makes sense to me! How about instead of a specific glue dataset we add that as a flag too.

python3 keras_nlp/benchmarks/glue.py \
  --model="BertClassifier" \
  --preset="bert_small_en_uncased" \
  --task_name="mrpc"

Totally fine with me if we only support mrpc as the only flag option for now, but that way long term this can grow into our one stop shop for glue performance benchmarking.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two high level thoughts, I will leave the more detailed review for @chenmoneygithub

Thanks very much!!

@@ -0,0 +1,193 @@
# Copyright 2023 The KerasNLP Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to glue.py and take in mrpc as a task name.

Totally ok if we just error out for any other glue tasks right now, but that will allow us to grow this script into one for testing glue in its entirety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then should I just remove the old glue.py? And I am sorry I didn't fully understand take in mrpc as a task name. should I include all tasks with mrpc being the default or will it only be for mrpc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw I think in the long term we will probably only use stsb (regression) and mrpc (classification). I am fine with naming it as glue.py, but we should add some disclaimer at the header saying "Benchmark with [selected tasks]".

@susnato I believe what Matt is suggesting is:

  • rename glue_mrpc.py to glue.py
  • set task flag, and default to mrpc. If other tasks are passed, error out.
  • keep code minimal to mrpc task, and add comments to note out other tasks are not supported yet.

This leaves room for scalability.

backbone.trainable = False
else:
backbone.trainable = True
# If the model has pooled_output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid this special casing and actually use the XXClassifier classes. This will be important for things like BART, which is going to look substantially different (feed the input sequence twice to the encoder and decoder and use the last token representation in the decoder block).

We could also have this script by default use the compilation defaults in the XXClassifier classes, which would be a great way to be able to test them out! And make sure they are reasonable defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks again for your comments, in the next commit I will change it to XXClassifier .

@susnato
Copy link
Contributor Author

susnato commented Mar 15, 2023

Hi, @mattdangerw @chenmoneygithub , thanks for your comments!
I updated the script to load only mrpc and renamed it glue_mrpc.py and also to retrieve the model I adopted the function you suggested(with one small addition that it also loads the Preprocessor class for the respective model). And also the time elapse is added .

Colab link to show if this script works or not(This version used in colab does not show the time elapse but the recent committed version does) - https://colab.research.google.com/drive/1eGMOXUF826ckuY5Tx03F2i8CsSSPYnbZ?usp=sharing

EDIT : I was going to send this comment but in meantime @mattdangerw suggested some changes, waiting for your thoughts @chenmoneygithub

@chenmoneygithub
Copy link
Contributor

@susnato Thanks! After you push the new commit, I will do a full pass review.

@susnato
Copy link
Contributor Author

susnato commented Mar 17, 2023

Hi @chenmoneygithub sorry for the delay, but I have updated and renamed the script, please check it.

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring it! Overall looks good! dropped some comments.

@@ -11,10 +11,14 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import csv

# DISCLAIMER:This script only supports GLUE/mrpc (for now). #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the standard head format and include a one-sentence description of this file. For example:

""" GLUE benchmark script to test model performance.

To run the script, use this command:

{Put command here}

Disclaimer: This script only supports GLUE/mrpc (for now).
"""

"Learning rate",
)

flags.DEFINE_string("model", None, "The Model you want to train and evaluate.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the same flag description as in benchmarks/sentiment_analysis.py: "The name of the classifier such as BertClassifier."

None,
"The name of TPU to connect to. If None, no TPU will be used. If you only "
"have one TPU, use `local`",
"The model preset(eg. For bert it is 'bert_base_en', 'bert_tiny_en_uncased')",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another flag mixed_precision_policy , see: link

@@ -23,86 +27,39 @@

import keras_nlp

FLAGS = flags.FLAGS
seed = 42
os.environ["PYTHONHASHSEED"] = str(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required?

for _preset in symbol.presets:
if preset and _preset != preset:
continue
if "Backbone" in name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can use the same method as here to simplify the code. Later on we can move this method to utils/ since we are reusing it.

Reading this function, I think you are trying to enable users to pass --model="Bert", which is a nice idea actually! But let's for now still force users to pass "BertClassifier" as the flag so that we could have a standardized workflow.

print("GPU available : ", tf.test.is_gpu_available())

print("=" * 120)
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use logging instead of print so that this script is easier to run on cloud.

from absl import logging, then logging.info()


# Load datasets
train_ds, test_ds, validation_ds = load_data()
train_ds = preprocess_data(dataset=train_ds, preprocessor=preprocessor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to apply the preprocessor explicitly, which will be automatically handled by our model.

validation_ds = preprocess_data(
dataset=validation_ds, preprocessor=preprocessor
)
print("GLUE/MRPC Dataset Loaded!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can delete this print, I believe TFDS already has some prints.

end_learning_rate=0.0,
)
optimizer = tf.keras.optimizers.experimental.AdamW(
lr, weight_decay=0.01, global_clipnorm=1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be cautious on the global_clipnorm. The glue script i wrote is a template, so I set the value as a reference, we can delete it from this file since it's more of an automatic script.


print("Training Finished!")
print(
f"Training took :: {(et-st):.4f} seconds, or {((et-st)/60):.2f} minutes, or {((et-st)/3600):.2f} hours!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just keep the seconds output, and let's call it wall_time which is the official name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also report the validation accuracy and examples_per_second, which is epochs * number_of_records / wall_time.

@susnato
Copy link
Contributor Author

susnato commented Mar 23, 2023

Hi @chenmoneygithub I have pushed the changes, please check it. And the colab link to show these changes is - https://colab.research.google.com/drive/1eGMOXUF826ckuY5Tx03F2i8CsSSPYnbZ?usp=sharing

@susnato susnato requested a review from chenmoneygithub March 23, 2023 14:20
@chenmoneygithub
Copy link
Contributor

@susnato The code looks good to me, but we don't mean to modify examples/glue_benchmark/glue.py in place. Let's add a new file keras_nlp/benchmarks/glue.py and copy the code there, and keep the current examples/glue_benchmark/glue.py, which is for a different usage. Thanks!

@susnato
Copy link
Contributor Author

susnato commented Mar 28, 2023

Hi @chenmoneygithub thanks for pointing that out!
I just pushed the code with new changes, please review them.

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! I pushed a commit for some minor style fix.

@chenmoneygithub
Copy link
Contributor

Okay.. there is a flag conflict issue because we are exporting this module via namex (Francois' tool). Ideally we should be able to exclude some folder by not setting __init__.py, but this functionality is not available yet. We will need to discuss how to get this submitted for now. @mattdangerw

@chenmoneygithub chenmoneygithub merged commit a5b6c45 into keras-team:master Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluate newer models with GLUE
3 participants