GLUE evaluation automation script #848

susnato · 2023-03-14T14:17:47Z

What does this PR do?

Fixes #764

susnato · 2023-03-14T14:21:03Z

Hi @chenmoneygithub I uploaded the script, please check it. I tested it for some models and tasks, and they are working. My time limit for Colab GPU is over for today, Tomorrow I will train those models (except BERT and Roberta) for 2 epochs and and share the link.

mattdangerw · 2023-03-14T23:31:12Z

Sorry probably missed some of the discussion leading up to this, but what is the point of this script over glue.py? Seems better for long term maintenance to just either adapt glue.py to work with multiple models or have these evaluations be one off modifications to that script that we don't land.

I would definitely see why we might want additional functionality around hyper-parameter search, or easier deployment to a GCP instance, but as is this seems too close to glue.py to merit a whole separate script. Will wait for @chenmoneygithub to weigh in.

chenmoneygithub · 2023-03-15T01:39:47Z

@susnato Thanks for the PR!

@mattdangerw The purpose is to make a light-weight benchmark which could just work by a command, similar to this. We need it because sentiment analysis is too simple.

@susnato I think what we want is a light-weight benchmark script which reports accuracy/time elapse on a selected GLUE task, let's just use MRPC. As Matt pointed out, the current script is following the examples/glue_benchmark fashion, which is too broad to be used as a light-weight benchmark. Ideally with your PR checked in, we should be able to run this command to get metrics we are interested in:

python3 keras_nlp/benchmarks/glue_mrpc.py \
  --model="BertClassifier" \
  --preset="bert_small_en_uncased" 
  -- {other flags}

For the current PR, I think you can delete code about tasks other than glue/mrpc, and we can rename the file to glue_mrpc.py. Also to retrieve the model, you can copy this function: link

Thanks for your contribution! 🍷

mattdangerw · 2023-03-15T04:47:42Z

Got it! That makes sense to me! How about instead of a specific glue dataset we add that as a flag too.

python3 keras_nlp/benchmarks/glue.py \
  --model="BertClassifier" \
  --preset="bert_small_en_uncased" \
  --task_name="mrpc"

Totally fine with me if we only support mrpc as the only flag option for now, but that way long term this can grow into our one stop shop for glue performance benchmarking.

mattdangerw

Two high level thoughts, I will leave the more detailed review for @chenmoneygithub

Thanks very much!!

mattdangerw · 2023-03-15T18:52:33Z

examples/glue_benchmark/glue_mrpc.py

@@ -0,0 +1,193 @@
+# Copyright 2023 The KerasNLP Authors


Let's rename this to glue.py and take in mrpc as a task name.

Totally ok if we just error out for any other glue tasks right now, but that will allow us to grow this script into one for testing glue in its entirety.

Ok, then should I just remove the old glue.py? And I am sorry I didn't fully understand take in mrpc as a task name. should I include all tasks with mrpc being the default or will it only be for mrpc?

@mattdangerw I think in the long term we will probably only use stsb (regression) and mrpc (classification). I am fine with naming it as glue.py, but we should add some disclaimer at the header saying "Benchmark with [selected tasks]".

@susnato I believe what Matt is suggesting is:

rename glue_mrpc.py to glue.py

set task flag, and default to mrpc. If other tasks are passed, error out.

keep code minimal to mrpc task, and add comments to note out other tasks are not supported yet.

This leaves room for scalability.

mattdangerw · 2023-03-15T18:57:44Z

examples/glue_benchmark/glue_mrpc.py

+        backbone.trainable = False
+    else:
+        backbone.trainable = True
+    # If the model has pooled_output


I would avoid this special casing and actually use the XXClassifier classes. This will be important for things like BART, which is going to look substantially different (feed the input sequence twice to the encoder and decoder and use the last token representation in the decoder block).

We could also have this script by default use the compilation defaults in the XXClassifier classes, which would be a great way to be able to test them out! And make sure they are reasonable defaults.

thanks again for your comments, in the next commit I will change it to XXClassifier .

susnato · 2023-03-15T19:05:30Z

Hi, @mattdangerw @chenmoneygithub , thanks for your comments!
I updated the script to load only mrpc and renamed it glue_mrpc.py and also to retrieve the model I adopted the function you suggested(with one small addition that it also loads the Preprocessor class for the respective model). And also the time elapse is added .

Colab link to show if this script works or not(This version used in colab does not show the time elapse but the recent committed version does) - https://colab.research.google.com/drive/1eGMOXUF826ckuY5Tx03F2i8CsSSPYnbZ?usp=sharing

EDIT : I was going to send this comment but in meantime @mattdangerw suggested some changes, waiting for your thoughts @chenmoneygithub

chenmoneygithub · 2023-03-15T23:15:47Z

@susnato Thanks! After you push the new commit, I will do a full pass review.

susnato · 2023-03-17T15:49:14Z

Hi @chenmoneygithub sorry for the delay, but I have updated and renamed the script, please check it.

chenmoneygithub

Thanks for refactoring it! Overall looks good! dropped some comments.

chenmoneygithub · 2023-03-21T05:19:49Z