espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor
espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor
class espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor(multimodal_io, vocab, vocab_intervals, audio_input: str = 'continuous_audio', audio_output: str = 'discrete_audio', loss_region: str = 'assistant', batchfy_method: str = 'bucket')
Bases: object
Preprocessor for SpeechLM data handling.
Converts raw data into model-ready format with tokenization, padding, and loss mask generation for multimodal sequences.
collate_fn(data_lst)
Batch multiple samples for training.
Processes each sample, pads sequences to same length, and organizes continuous features by modality. Returns dict ready for model forward.
diagnose(data)
Print human-readable representation of processed data for debugging.
Shows tokens, loss masks, and continuous feature info frame by frame.
find_length(key, data_dict)
Quickly compute sequence length without full preprocessing.
Counts tokens for BOS, role/modality markers, content, and EOS/EOT. Used for efficient batch construction.
preprocessing(key, data_dict)
Convert single raw data dict into training-ready format.
Applies chat template, tokenizes content, adds special tokens, and creates loss masks. Returns dict with sequences and features.
special_mask(value)
Create loss mask for special tokens (1 frame, multi-stream).
Only first stream has the actual value, others are zero.
special_token(token)
Convert special token string to multi-stream token array.
Places token ID in first stream, padding tokens in other streams.
