Open
Description
Describe the bug
Unable to generate a sequences of SEQ_LEN
words from a custom Tokenizer
+ CasualMLProcessor
.
How can i train from scratch a custom GPT model ?
With custom I mean:
- Generate/Adapt the
vocabulary
andmerge
file for a custom tokenizer + preprocessor - Use a random GPT gpt backbone architecture like the one provided in the example.
To Reproduce
https://colab.research.google.com/drive/1035pjfqUoKUIo-KD2kCosLZw2N1F9wzJ?usp=sharing
Expected behavior
Loading a custom Tokenizer
+ CasualMLProcessor
for the CasualLM
should generate a SEQ_LEN
words sequence.
Additional context
The vocab and mergs are loaded from here: GroNLP
Would you like to help us fix it?
Of course
Here is a preview version of the colab:
# -*- coding: utf-8 -*-
"""Untitled1.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1035pjfqUoKUIo-KD2kCosLZw2N1F9wzJ
"""
!pip install pip --upgrade
!pip install git+https://github.com/keras-team/keras-nlp.git keras-core tensorflow --upgrade
import keras_core as keras
import tensorflow as tf
import keras_nlp
print(keras.__version__,tf.__version__,keras_nlp.__version__)
# >> Using TensorFlow backend
# >> 0.1.7 2.15.0 0.7.0
!wget "https://huggingface.co/GroNLP/gpt2-small-italian/raw/main/merges.txt"# -o "merges.txt"
!wget "https://huggingface.co/GroNLP/gpt2-small-italian/raw/main/vocab.json"# -o "vocab.json"
!ls
prompt = "Questo e' un esempio"
SEQ_LEN = 32
"""This is a non working example of a custom tokenizer"""
custom_tokenizer = keras_nlp.models.GPT2Tokenizer(
vocabulary="vocab.json.1",
merges="merges.txt.1",
)
custom_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor(
tokenizer=custom_tokenizer,
sequence_length=SEQ_LEN,
)
custom_backbone = keras_nlp.models.GPT2Backbone(
vocabulary_size=custom_preprocessor.tokenizer.vocabulary_size(),
num_layers=3,
num_heads=3,
hidden_dim=32,
intermediate_dim=64,
max_sequence_length=SEQ_LEN,
)
custom_gpt2_lm = keras_nlp.models.GPT2CausalLM(
backbone=custom_backbone,
preprocessor=custom_preprocessor,
)
custom_gpt2_lm.generate(prompt)
# >> Questo e' un esempio
"""In the above output, we have "rewritten" the input. We have not generated a sequence of `SEQ_LEN` words."""
custom_tokenizer.merges[-10:],list(custom_tokenizer.vocabulary.items())[-10:]
"""This is a working example"""
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
"gpt2_base_en",
sequence_length=SEQ_LEN,
)
backbone = keras_nlp.models.GPT2Backbone(
vocabulary_size=preprocessor.tokenizer.vocabulary_size(),
num_layers=3,
num_heads=3,
hidden_dim=32,
intermediate_dim=64,
max_sequence_length=SEQ_LEN,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM(
backbone=backbone,
preprocessor=preprocessor,
)
print(gpt2_lm.generate(prompt))
# >> Questo e' un esempioGalleryblack tracked Diाupdatediege puppy oxygenía distur 347 fixtures allowanceIP goof Chocobo Elect NASAREF made Sunni compute
"""In the above output, we have we have generated a sequence of `SEQ_LEN` words"""
preprocessor.tokenizer.merges[-10:],list(preprocessor.tokenizer.vocabulary.items())[-10:]
EDIT:
I think that I've solved. The problem is related to the vocabulary. Does someone know where i can find the specs for the vocabulary
and merges
parameter without digging into the code?