Skip to content

Improve docstring of Albert #862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

soma2000-lang
Copy link
Contributor

@soma2000-lang soma2000-lang commented Mar 17, 2023

@soma2000-lang soma2000-lang changed the title Add Improve docstring of Albert Mar 17, 2023
@mattdangerw mattdangerw self-requested a review March 17, 2023 21:18
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

Overall this needs two things.

  • Test out all the examples here, some are unrunnable. You can pip install git+https://github.com/keras-team/keras-nlp.git --upgrade in a colab to run our latest code changes.
  • Pay close attention to formatting and style. Make sure to indent and leave line breaks for clarify to match the original PR. Note that copying and pasting from github will UI remove all empty lines.

@@ -33,7 +33,7 @@ def albert_kernel_initializer(stddev=0.02):

@keras_nlp_export("keras_nlp.models.AlbertBackbone")
class AlbertBackbone(Backbone):
"""ALBERT encoder network.
"""A ALBERT encoder network.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An ALBERT...

a classification task. For usage of this model with pre-trained weights, see
the `from_preset()` method.
This model attaches a classification head to a
`keras_nlp.model.AlbertBackbone`instance, mapping from the backbone outputs to logit output suitable for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after backtick, make sure to keep line lengths under 80

@@ -107,19 +66,22 @@ class AlbertClassifier(Task):
sequence_length=128,
)

# Create a AlbertClassifier and fit your data.
#Pretrained classifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after #

```python
features = ["The quick brown fox jumped.", "I forgot my homework."]
labels = [0, 3]
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this will not work. You will need something like below as albert is based on sentencepiece. Please make sure to test out your code!

bytes_io = io.BytesIO()
sentencepiece.SentencePieceTrainer.train(
    sentence_iterator=iter(["The quick brown fox jumped."]),
    model_writer=bytes_io,
    vocab_size=8,
    model_type="WORD",
    pad_id=0,
    unk_id=1,
    bos_id=2,
    eos_id=3,
    pad_piece="<pad>",
    unk_piece="<unk>",
    bos_piece="[CLS]",
    eos_piece="[SEP]",
    user_defined_symbols="[MASK]",
)
tokenizer = keras_nlp.tokenizers.AlbertTokenizer(proto=bytes_io.getvalue())

classifier = keras_nlp.models.AlbertClassifier.from_preset(
"albert_base_en_uncased",
num_classes=4,
preprocessor=preprocessor,
)
# Re-compile (e.g., with a new learning rate).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing in the bert classifier example where we run fit before "re compiling"

@@ -70,71 +57,65 @@ class AlbertPreprocessor(Preprocessor):
"waterfall" algorithm that allocates quota in a
left-to-right manner and fills up the buckets until we run
out of budget. It supports an arbitrary number of segments.
Call arguments:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline, fix alignment

preprocessor = keras_nlp.models.AlbertPreprocessor(tokenizer)
preprocessor("The quick brown fox jumped.")
```
Mapping with `tf.data.Dataset`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty newlines through this file to match the bert version

@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

# Detokenization.
tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~10 lines up from here, you should be instantiating the tokenizer with from_preset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line is tokenizer = keras_nlp.models.AlbertTokenizer(proto="model.spm")

@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

# Detokenization.
tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work! As mentioned above you will need the sentencepiece trainer.

loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
jit_compile=True
masked_lm = keras_nlp.models.AlbertMaskedLM.from_preset(
"albert_base_en_uncased",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation.

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! There seems to be some unaddressed comments, please fix them, thanks!


This class implements a bi-directional Transformer-based encoder as
described in
["ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"](https://arxiv.org/abs/1909.11942).
["ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the linebreak here, it's okay to exceed the limit if it's a hyperlink.

@@ -68,6 +68,15 @@ class AlbertMaskedLMPreprocessor(AlbertPreprocessor):
left-to-right manner and fills up the buckets until we run
out of budget. It supports an arbitrary number of segments.

Call arguments:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed, please fix it, thanks!

["The quick brown fox jumped.", "Call me Ishmael."]
)
preprocessor(sentences)
preprocessor("The quick brown fox jumped.", "Call me Ishmael.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one too - preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])

@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

# Detokenization.
tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line is tokenizer = keras_nlp.models.AlbertTokenizer(proto="model.spm")

@mattdangerw
Copy link
Member

This has landed elsewhere. (We are lining up these changes for an upcoming release)

Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants