-
Notifications
You must be signed in to change notification settings - Fork 287
Improve docstring of Albert #862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
Overall this needs two things.
- Test out all the examples here, some are unrunnable. You can
pip install git+https://github.com/keras-team/keras-nlp.git --upgrade
in a colab to run our latest code changes. - Pay close attention to formatting and style. Make sure to indent and leave line breaks for clarify to match the original PR. Note that copying and pasting from github will UI remove all empty lines.
@@ -33,7 +33,7 @@ def albert_kernel_initializer(stddev=0.02): | |||
|
|||
@keras_nlp_export("keras_nlp.models.AlbertBackbone") | |||
class AlbertBackbone(Backbone): | |||
"""ALBERT encoder network. | |||
"""A ALBERT encoder network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An ALBERT...
a classification task. For usage of this model with pre-trained weights, see | ||
the `from_preset()` method. | ||
This model attaches a classification head to a | ||
`keras_nlp.model.AlbertBackbone`instance, mapping from the backbone outputs to logit output suitable for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after backtick, make sure to keep line lengths under 80
@@ -107,19 +66,22 @@ class AlbertClassifier(Task): | |||
sequence_length=128, | |||
) | |||
|
|||
# Create a AlbertClassifier and fit your data. | |||
#Pretrained classifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after #
```python | ||
features = ["The quick brown fox jumped.", "I forgot my homework."] | ||
labels = [0, 3] | ||
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this will not work. You will need something like below as albert is based on sentencepiece. Please make sure to test out your code!
bytes_io = io.BytesIO()
sentencepiece.SentencePieceTrainer.train(
sentence_iterator=iter(["The quick brown fox jumped."]),
model_writer=bytes_io,
vocab_size=8,
model_type="WORD",
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
pad_piece="<pad>",
unk_piece="<unk>",
bos_piece="[CLS]",
eos_piece="[SEP]",
user_defined_symbols="[MASK]",
)
tokenizer = keras_nlp.tokenizers.AlbertTokenizer(proto=bytes_io.getvalue())
classifier = keras_nlp.models.AlbertClassifier.from_preset( | ||
"albert_base_en_uncased", | ||
num_classes=4, | ||
preprocessor=preprocessor, | ||
) | ||
# Re-compile (e.g., with a new learning rate). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing in the bert classifier example where we run fit before "re compiling"
@@ -70,71 +57,65 @@ class AlbertPreprocessor(Preprocessor): | |||
"waterfall" algorithm that allocates quota in a | |||
left-to-right manner and fills up the buckets until we run | |||
out of budget. It supports an arbitrary number of segments. | |||
Call arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add newline, fix alignment
preprocessor = keras_nlp.models.AlbertPreprocessor(tokenizer) | ||
preprocessor("The quick brown fox jumped.") | ||
``` | ||
Mapping with `tf.data.Dataset`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add empty newlines through this file to match the bert version
@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer): | |||
|
|||
# Detokenization. | |||
tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]])) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~10 lines up from here, you should be instantiating the tokenizer with from_preset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the line is tokenizer = keras_nlp.models.AlbertTokenizer(proto="model.spm")
@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer): | |||
|
|||
# Detokenization. | |||
tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]])) | |||
|
|||
# Custom vocabulary. | |||
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work! As mentioned above you will need the sentencepiece trainer.
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), | ||
jit_compile=True | ||
masked_lm = keras_nlp.models.AlbertMaskedLM.from_preset( | ||
"albert_base_en_uncased", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix indentation.
a091508
to
73bee57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! There seems to be some unaddressed comments, please fix them, thanks!
|
||
This class implements a bi-directional Transformer-based encoder as | ||
described in | ||
["ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"](https://arxiv.org/abs/1909.11942). | ||
["ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the linebreak here, it's okay to exceed the limit if it's a hyperlink.
@@ -68,6 +68,15 @@ class AlbertMaskedLMPreprocessor(AlbertPreprocessor): | |||
left-to-right manner and fills up the buckets until we run | |||
out of budget. It supports an arbitrary number of segments. | |||
|
|||
Call arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has not been addressed, please fix it, thanks!
["The quick brown fox jumped.", "Call me Ishmael."] | ||
) | ||
preprocessor(sentences) | ||
preprocessor("The quick brown fox jumped.", "Call me Ishmael.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one too - preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])
@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer): | |||
|
|||
# Detokenization. | |||
tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]])) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the line is tokenizer = keras_nlp.models.AlbertTokenizer(proto="model.spm")
This has landed elsewhere. (We are lining up these changes for an upcoming release) Thanks for the contribution! |
Similar to #843
@mattdangerw @jbischof @chenmoneygithub