-
Notifications
You must be signed in to change notification settings - Fork 287
GLUE evaluation automation script #848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @chenmoneygithub I uploaded the script, please check it. I tested it for some models and tasks, and they are working. My time limit for Colab GPU is over for today, Tomorrow I will train those models (except BERT and Roberta) for 2 epochs and and share the link. |
Sorry probably missed some of the discussion leading up to this, but what is the point of this script over I would definitely see why we might want additional functionality around hyper-parameter search, or easier deployment to a GCP instance, but as is this seems too close to |
@susnato Thanks for the PR! @mattdangerw The purpose is to make a light-weight benchmark which could just work by a command, similar to this. We need it because sentiment analysis is too simple. @susnato I think what we want is a light-weight benchmark script which reports accuracy/time elapse on a selected GLUE task, let's just use MRPC. As Matt pointed out, the current script is following the examples/glue_benchmark fashion, which is too broad to be used as a light-weight benchmark. Ideally with your PR checked in, we should be able to run this command to get metrics we are interested in:
For the current PR, I think you can delete code about tasks other than glue/mrpc, and we can rename the file to Thanks for your contribution! 🍷 |
Got it! That makes sense to me! How about instead of a specific glue dataset we add that as a flag too.
Totally fine with me if we only support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two high level thoughts, I will leave the more detailed review for @chenmoneygithub
Thanks very much!!
examples/glue_benchmark/glue_mrpc.py
Outdated
@@ -0,0 +1,193 @@ | |||
# Copyright 2023 The KerasNLP Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to glue.py and take in mrpc as a task name.
Totally ok if we just error out for any other glue tasks right now, but that will allow us to grow this script into one for testing glue in its entirety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then should I just remove the old glue.py
? And I am sorry I didn't fully understand take in mrpc as a task name.
should I include all tasks with mrpc
being the default or will it only be for mrpc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw I think in the long term we will probably only use stsb
(regression) and mrpc
(classification). I am fine with naming it as glue.py
, but we should add some disclaimer at the header saying "Benchmark with [selected tasks]".
@susnato I believe what Matt is suggesting is:
- rename
glue_mrpc.py
toglue.py
- set
task
flag, and default tomrpc
. If other tasks are passed, error out. - keep code minimal to
mrpc
task, and add comments to note out other tasks are not supported yet.
This leaves room for scalability.
examples/glue_benchmark/glue_mrpc.py
Outdated
backbone.trainable = False | ||
else: | ||
backbone.trainable = True | ||
# If the model has pooled_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid this special casing and actually use the XXClassifier
classes. This will be important for things like BART, which is going to look substantially different (feed the input sequence twice to the encoder and decoder and use the last token representation in the decoder block).
We could also have this script by default use the compilation defaults in the XXClassifier
classes, which would be a great way to be able to test them out! And make sure they are reasonable defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks again for your comments, in the next commit I will change it to XXClassifier
.
Hi, @mattdangerw @chenmoneygithub , thanks for your comments! Colab link to show if this script works or not(This version used in colab does not show the time elapse but the recent committed version does) - https://colab.research.google.com/drive/1eGMOXUF826ckuY5Tx03F2i8CsSSPYnbZ?usp=sharing EDIT : I was going to send this comment but in meantime @mattdangerw suggested some changes, waiting for your thoughts @chenmoneygithub |
@susnato Thanks! After you push the new commit, I will do a full pass review. |
Hi @chenmoneygithub sorry for the delay, but I have updated and renamed the script, please check it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for refactoring it! Overall looks good! dropped some comments.
examples/glue_benchmark/glue.py
Outdated
@@ -11,10 +11,14 @@ | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
import csv | |||
|
|||
# DISCLAIMER:This script only supports GLUE/mrpc (for now). # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the standard head format and include a one-sentence description of this file. For example:
""" GLUE benchmark script to test model performance.
To run the script, use this command:
{Put command here}
Disclaimer: This script only supports GLUE/mrpc (for now).
"""
examples/glue_benchmark/glue.py
Outdated
"Learning rate", | ||
) | ||
|
||
flags.DEFINE_string("model", None, "The Model you want to train and evaluate.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the same flag description as in benchmarks/sentiment_analysis.py: "The name of the classifier such as BertClassifier."
examples/glue_benchmark/glue.py
Outdated
None, | ||
"The name of TPU to connect to. If None, no TPU will be used. If you only " | ||
"have one TPU, use `local`", | ||
"The model preset(eg. For bert it is 'bert_base_en', 'bert_tiny_en_uncased')", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need another flag mixed_precision_policy
, see: link
examples/glue_benchmark/glue.py
Outdated
@@ -23,86 +27,39 @@ | |||
|
|||
import keras_nlp | |||
|
|||
FLAGS = flags.FLAGS | |||
seed = 42 | |||
os.environ["PYTHONHASHSEED"] = str(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required?
examples/glue_benchmark/glue.py
Outdated
for _preset in symbol.presets: | ||
if preset and _preset != preset: | ||
continue | ||
if "Backbone" in name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can use the same method as here to simplify the code. Later on we can move this method to utils/
since we are reusing it.
Reading this function, I think you are trying to enable users to pass --model="Bert"
, which is a nice idea actually! But let's for now still force users to pass "BertClassifier" as the flag so that we could have a standardized workflow.
examples/glue_benchmark/glue.py
Outdated
print("GPU available : ", tf.test.is_gpu_available()) | ||
|
||
print("=" * 120) | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use logging
instead of print
so that this script is easier to run on cloud.
from absl import logging
, then logging.info()
examples/glue_benchmark/glue.py
Outdated
|
||
# Load datasets | ||
train_ds, test_ds, validation_ds = load_data() | ||
train_ds = preprocess_data(dataset=train_ds, preprocessor=preprocessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to apply the preprocessor explicitly, which will be automatically handled by our model.
examples/glue_benchmark/glue.py
Outdated
validation_ds = preprocess_data( | ||
dataset=validation_ds, preprocessor=preprocessor | ||
) | ||
print("GLUE/MRPC Dataset Loaded!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete this print, I believe TFDS already has some prints.
examples/glue_benchmark/glue.py
Outdated
end_learning_rate=0.0, | ||
) | ||
optimizer = tf.keras.optimizers.experimental.AdamW( | ||
lr, weight_decay=0.01, global_clipnorm=1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be cautious on the global_clipnorm
. The glue script i wrote is a template, so I set the value as a reference, we can delete it from this file since it's more of an automatic script.
examples/glue_benchmark/glue.py
Outdated
|
||
print("Training Finished!") | ||
print( | ||
f"Training took :: {(et-st):.4f} seconds, or {((et-st)/60):.2f} minutes, or {((et-st)/3600):.2f} hours!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just keep the seconds output, and let's call it wall_time
which is the official name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also report the validation accuracy and examples_per_second
, which is epochs * number_of_records / wall_time
.
Hi @chenmoneygithub I have pushed the changes, please check it. And the colab link to show these changes is - https://colab.research.google.com/drive/1eGMOXUF826ckuY5Tx03F2i8CsSSPYnbZ?usp=sharing |
@susnato The code looks good to me, but we don't mean to modify |
Hi @chenmoneygithub thanks for pointing that out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM! I pushed a commit for some minor style fix.
Okay.. there is a flag conflict issue because we are exporting this module via |
What does this PR do?
Fixes #764
@chenmoneygithub