Improve docstring of Albert #862

soma2000-lang · 2023-03-17T09:23:19Z

Similar to #843
@mattdangerw @jbischof @chenmoneygithub

mattdangerw

Left some comments.

Overall this needs two things.

Test out all the examples here, some are unrunnable. You can pip install git+https://github.com/keras-team/keras-nlp.git --upgrade in a colab to run our latest code changes.
Pay close attention to formatting and style. Make sure to indent and leave line breaks for clarify to match the original PR. Note that copying and pasting from github will UI remove all empty lines.

mattdangerw · 2023-03-17T16:48:08Z

keras_nlp/models/albert/albert_backbone.py

@@ -33,7 +33,7 @@ def albert_kernel_initializer(stddev=0.02):

 @keras_nlp_export("keras_nlp.models.AlbertBackbone")
 class AlbertBackbone(Backbone):
-    """ALBERT encoder network.
+    """A ALBERT encoder network.


An ALBERT...

mattdangerw · 2023-03-17T16:48:52Z

keras_nlp/models/albert/albert_classifier.py

-    a classification task. For usage of this model with pre-trained weights, see
-    the `from_preset()` method.
+    This model attaches a classification head to a
+    `keras_nlp.model.AlbertBackbone`instance, mapping from the backbone outputs to logit output suitable for


space after backtick, make sure to keep line lengths under 80

mattdangerw · 2023-03-17T21:19:18Z

keras_nlp/models/albert/albert_classifier.py

@@ -107,19 +66,22 @@ class AlbertClassifier(Task):
        sequence_length=128,
    )

-    # Create a AlbertClassifier and fit your data.
+    #Pretrained classifier.


Space after #

mattdangerw · 2023-03-17T21:29:05Z

keras_nlp/models/albert/albert_classifier.py

+    ```python
+    features = ["The quick brown fox jumped.", "I forgot my homework."]
+    labels = [0, 3]
+    vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]


I'm pretty sure this will not work. You will need something like below as albert is based on sentencepiece. Please make sure to test out your code!

bytes_io = io.BytesIO() sentencepiece.SentencePieceTrainer.train( sentence_iterator=iter(["The quick brown fox jumped."]), model_writer=bytes_io, vocab_size=8, model_type="WORD", pad_id=0, unk_id=1, bos_id=2, eos_id=3, pad_piece="<pad>", unk_piece="<unk>", bos_piece="[CLS]", eos_piece="[SEP]", user_defined_symbols="[MASK]", ) tokenizer = keras_nlp.tokenizers.AlbertTokenizer(proto=bytes_io.getvalue())

mattdangerw · 2023-03-17T21:29:53Z

keras_nlp/models/albert/albert_classifier.py

    classifier = keras_nlp.models.AlbertClassifier.from_preset(
        "albert_base_en_uncased",
        num_classes=4,
        preprocessor=preprocessor,
    )
+    # Re-compile (e.g., with a new learning rate).


I think you are missing in the bert classifier example where we run fit before "re compiling"

mattdangerw · 2023-03-17T21:33:43Z

keras_nlp/models/albert/albert_preprocessor.py

@@ -70,71 +57,65 @@ class AlbertPreprocessor(Preprocessor):
                    "waterfall" algorithm that allocates quota in a
                    left-to-right manner and fills up the buckets until we run
                    out of budget. It supports an arbitrary number of segments.
+        Call arguments:


Add newline, fix alignment

mattdangerw · 2023-03-17T21:34:06Z

keras_nlp/models/albert/albert_preprocessor.py

+    preprocessor = keras_nlp.models.AlbertPreprocessor(tokenizer)
+    preprocessor("The quick brown fox jumped.")
+    ```
+    Mapping with `tf.data.Dataset`.


Add empty newlines through this file to match the bert version

mattdangerw · 2023-03-17T21:35:08Z

keras_nlp/models/albert/albert_tokenizer.py

@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

    # Detokenization.
    tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))
+


~10 lines up from here, you should be instantiating the tokenizer with from_preset

the line is tokenizer = keras_nlp.models.AlbertTokenizer(proto="model.spm")

mattdangerw · 2023-03-17T21:35:20Z

keras_nlp/models/albert/albert_tokenizer.py

@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

    # Detokenization.
    tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))
+
+    # Custom vocabulary.
+    vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]


This will not work! As mentioned above you will need the sentencepiece trainer.

mattdangerw · 2023-03-17T21:35:49Z

keras_nlp/models/albert/albert_masked_lm.py

-        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
-        jit_compile=True
+    masked_lm = keras_nlp.models.AlbertMaskedLM.from_preset(
+    "albert_base_en_uncased",


Fix indentation.

chenmoneygithub

Thanks for the PR! There seems to be some unaddressed comments, please fix them, thanks!

chenmoneygithub · 2023-03-27T23:12:02Z

keras_nlp/models/albert/albert_backbone.py


    This class implements a bi-directional Transformer-based encoder as
    described in
-    ["ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"](https://arxiv.org/abs/1909.11942).
+    ["ALBERT: A Lite BERT for Self-supervised Learning of Language Representations"]


Let's remove the linebreak here, it's okay to exceed the limit if it's a hyperlink.

chenmoneygithub · 2023-03-27T23:56:58Z

keras_nlp/models/albert/albert_masked_lm_preprocessor.py

@@ -68,6 +68,15 @@ class AlbertMaskedLMPreprocessor(AlbertPreprocessor):
                    left-to-right manner and fills up the buckets until we run
                    out of budget. It supports an arbitrary number of segments.

+        Call arguments:


This comment has not been addressed, please fix it, thanks!

chenmoneygithub · 2023-03-27T23:57:40Z

keras_nlp/models/albert/albert_masked_lm_preprocessor.py

-        ["The quick brown fox jumped.", "Call me Ishmael."]
-    )
-    preprocessor(sentences)
+    preprocessor("The quick brown fox jumped.", "Call me Ishmael.")


this one too - preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])

chenmoneygithub · 2023-03-27T23:58:58Z

keras_nlp/models/albert/albert_tokenizer.py

@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

    # Detokenization.
    tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))
+


the line is tokenizer = keras_nlp.models.AlbertTokenizer(proto="model.spm")

mattdangerw · 2023-04-19T17:55:58Z

This has landed elsewhere. (We are lining up these changes for an upcoming release)

Thanks for the contribution!

soma2000-lang changed the title ~~Add~~ Improve docstring of Albert Mar 17, 2023

mattdangerw self-requested a review March 17, 2023 21:18

mattdangerw requested changes Mar 17, 2023

View reviewed changes

soma2000-lang added 2 commits March 21, 2023 16:02

Add

e4f245d

add

73bee57

soma2000-lang force-pushed the docstring branch from a091508 to 73bee57 Compare March 21, 2023 17:15

mattdangerw assigned chenmoneygithub Mar 22, 2023

Update albert_classifier.py

0ec0895

chenmoneygithub suggested changes Mar 27, 2023

View reviewed changes

mattdangerw closed this Apr 19, 2023

		@@ -61,6 +61,13 @@ class AlbertTokenizer(SentencePieceTokenizer):

		# Detokenization.
		tokenizer.detokenize(tf.constant([[2, 14, 2231, 886, 2385, 3]]))

Improve docstring of Albert #862

Improve docstring of Albert #862

Uh oh!

Conversation

soma2000-lang commented Mar 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Apr 19, 2023

Uh oh!

Uh oh!

soma2000-lang commented Mar 17, 2023 •

edited

Loading