Skip to content

[GPT2CausalLM] Unable to generate text if the GPT2CausalLMPreprocessor is not pretrained #1330

Open
@alessiosavi

Description

@alessiosavi

Describe the bug
Unable to generate a sequences of SEQ_LEN words from a custom Tokenizer + CasualMLProcessor.
How can i train from scratch a custom GPT model ?
With custom I mean:

  1. Generate/Adapt the vocabulary and merge file for a custom tokenizer + preprocessor
  2. Use a random GPT gpt backbone architecture like the one provided in the example.

To Reproduce
https://colab.research.google.com/drive/1035pjfqUoKUIo-KD2kCosLZw2N1F9wzJ?usp=sharing

Expected behavior
Loading a custom Tokenizer + CasualMLProcessor for the CasualLM should generate a SEQ_LEN words sequence.

Additional context
The vocab and mergs are loaded from here: GroNLP

Would you like to help us fix it?
Of course

Here is a preview version of the colab:

# -*- coding: utf-8 -*-
"""Untitled1.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1035pjfqUoKUIo-KD2kCosLZw2N1F9wzJ
"""

!pip install pip --upgrade

!pip install git+https://github.com/keras-team/keras-nlp.git keras-core tensorflow --upgrade

import keras_core as keras
import tensorflow as tf
import keras_nlp
print(keras.__version__,tf.__version__,keras_nlp.__version__)
# >> Using TensorFlow backend
# >> 0.1.7 2.15.0 0.7.0

!wget "https://huggingface.co/GroNLP/gpt2-small-italian/raw/main/merges.txt"# -o "merges.txt"
!wget "https://huggingface.co/GroNLP/gpt2-small-italian/raw/main/vocab.json"# -o "vocab.json"
!ls

prompt = "Questo e' un esempio"
SEQ_LEN = 32

"""This is a non working example of a custom tokenizer"""

custom_tokenizer = keras_nlp.models.GPT2Tokenizer(
    vocabulary="vocab.json.1",
    merges="merges.txt.1",
)
custom_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor(
    tokenizer=custom_tokenizer,
    sequence_length=SEQ_LEN,
)

custom_backbone = keras_nlp.models.GPT2Backbone(
  vocabulary_size=custom_preprocessor.tokenizer.vocabulary_size(),
  num_layers=3,
  num_heads=3,
  hidden_dim=32,
  intermediate_dim=64,
  max_sequence_length=SEQ_LEN,
)
custom_gpt2_lm = keras_nlp.models.GPT2CausalLM(
  backbone=custom_backbone,
  preprocessor=custom_preprocessor,
)
custom_gpt2_lm.generate(prompt)
# >> Questo e' un esempio

"""In the above output, we have "rewritten" the input. We have not generated a sequence of `SEQ_LEN` words."""

custom_tokenizer.merges[-10:],list(custom_tokenizer.vocabulary.items())[-10:]

"""This is a working example"""

preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=SEQ_LEN,
)

backbone = keras_nlp.models.GPT2Backbone(
  vocabulary_size=preprocessor.tokenizer.vocabulary_size(),
  num_layers=3,
  num_heads=3,
  hidden_dim=32,
  intermediate_dim=64,
  max_sequence_length=SEQ_LEN,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM(
  backbone=backbone,
  preprocessor=preprocessor,
)
print(gpt2_lm.generate(prompt))
# >> Questo e' un esempioGalleryblack tracked Diाupdatediege puppy oxygenía distur 347 fixtures allowanceIP goof Chocobo Elect NASAREF made Sunni compute


"""In the above output, we have we have generated a sequence of `SEQ_LEN` words"""

preprocessor.tokenizer.merges[-10:],list(preprocessor.tokenizer.vocabulary.items())[-10:]

EDIT:
I think that I've solved. The problem is related to the vocabulary. Does someone know where i can find the specs for the vocabulary and merges parameter without digging into the code?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions