[GPT2CausalLM] Unable to generate text if the `GPT2CausalLMPreprocessor` is not pretrained

**Describe the bug**
Unable to generate a sequences of `SEQ_LEN` words from a custom `Tokenizer` + `CasualMLProcessor`.
How can i train from scratch a custom GPT model ?
With custom I mean:
1. Generate/Adapt the `vocabulary` and `merge` file for a custom tokenizer + preprocessor
2. Use a random GPT gpt backbone architecture like the one provided in the example.

**To Reproduce**
https://colab.research.google.com/drive/1035pjfqUoKUIo-KD2kCosLZw2N1F9wzJ?usp=sharing

**Expected behavior**
Loading a custom `Tokenizer` + `CasualMLProcessor` for the `CasualLM` should generate a `SEQ_LEN` words sequence.

**Additional context**
The vocab and mergs are loaded from here: [GroNLP](https://huggingface.co/GroNLP/gpt2-small-italian/tree/main)

**Would you like to help us fix it?**
Of course


Here is a preview version of the colab:


```python
# -*- coding: utf-8 -*-
"""Untitled1.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1035pjfqUoKUIo-KD2kCosLZw2N1F9wzJ
"""

!pip install pip --upgrade

!pip install git+https://github.com/keras-team/keras-nlp.git keras-core tensorflow --upgrade

import keras_core as keras
import tensorflow as tf
import keras_nlp
print(keras.__version__,tf.__version__,keras_nlp.__version__)
# >> Using TensorFlow backend
# >> 0.1.7 2.15.0 0.7.0

!wget "https://huggingface.co/GroNLP/gpt2-small-italian/raw/main/merges.txt"# -o "merges.txt"
!wget "https://huggingface.co/GroNLP/gpt2-small-italian/raw/main/vocab.json"# -o "vocab.json"
!ls

prompt = "Questo e' un esempio"
SEQ_LEN = 32

"""This is a non working example of a custom tokenizer"""

custom_tokenizer = keras_nlp.models.GPT2Tokenizer(
    vocabulary="vocab.json.1",
    merges="merges.txt.1",
)
custom_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor(
    tokenizer=custom_tokenizer,
    sequence_length=SEQ_LEN,
)

custom_backbone = keras_nlp.models.GPT2Backbone(
  vocabulary_size=custom_preprocessor.tokenizer.vocabulary_size(),
  num_layers=3,
  num_heads=3,
  hidden_dim=32,
  intermediate_dim=64,
  max_sequence_length=SEQ_LEN,
)
custom_gpt2_lm = keras_nlp.models.GPT2CausalLM(
  backbone=custom_backbone,
  preprocessor=custom_preprocessor,
)
custom_gpt2_lm.generate(prompt)
# >> Questo e' un esempio

"""In the above output, we have "rewritten" the input. We have not generated a sequence of `SEQ_LEN` words."""

custom_tokenizer.merges[-10:],list(custom_tokenizer.vocabulary.items())[-10:]

"""This is a working example"""

preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=SEQ_LEN,
)

backbone = keras_nlp.models.GPT2Backbone(
  vocabulary_size=preprocessor.tokenizer.vocabulary_size(),
  num_layers=3,
  num_heads=3,
  hidden_dim=32,
  intermediate_dim=64,
  max_sequence_length=SEQ_LEN,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM(
  backbone=backbone,
  preprocessor=preprocessor,
)
print(gpt2_lm.generate(prompt))
# >> Questo e' un esempioGalleryblack tracked Diाupdatediege puppy oxygenía distur 347 fixtures allowanceIP goof Chocobo Elect NASAREF made Sunni compute


"""In the above output, we have we have generated a sequence of `SEQ_LEN` words"""

preprocessor.tokenizer.merges[-10:],list(preprocessor.tokenizer.vocabulary.items())[-10:]
```

________________________________________________________________________________________________________________________
**_EDIT:_**
I think that I've solved. The problem is related to the vocabulary. Does someone know where i can find the specs for the `vocabulary` and `merges` parameter without digging into the code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPT2CausalLM] Unable to generate text if the `GPT2CausalLMPreprocessor` is not pretrained #1330

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GPT2CausalLM] Unable to generate text if the GPT2CausalLMPreprocessor is not pretrained #1330

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[GPT2CausalLM] Unable to generate text if the `GPT2CausalLMPreprocessor` is not pretrained #1330