espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor

Less than 1 minute

espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor

class espnet2.speechlm.model.speechlm.speechlm_job.SpeechLMPreprocessor(multimodal_io, vocab, vocab_intervals, audio_input: str = 'continuous_audio', audio_output: str = 'discrete_audio', loss_region: str = 'assistant', batchfy_method: str = 'bucket')

Bases: object

Preprocessor for SpeechLM data handling.

Converts raw data into model-ready format with tokenization, padding, and loss mask generation for multimodal sequences.

collate_fn(data_lst)

Batch multiple samples for training.

Processes each sample, pads sequences to same length, and organizes continuous features by modality. Returns dict ready for model forward.

diagnose(data)

Print human-readable representation of processed data for debugging.

Shows tokens, loss masks, and continuous feature info frame by frame.

find_length(key, data_dict)

Quickly compute sequence length without full preprocessing.

Counts tokens for BOS, role/modality markers, content, and EOS/EOT. Used for efficient batch construction.

preprocessing(key, data_dict)

Convert single raw data dict into training-ready format.

Applies chat template, tokenizes content, adds special tokens, and creates loss masks. Returns dict with sequences and features.

special_mask(value)

Create loss mask for special tokens (1 frame, multi-stream).

Only first stream has the actual value, others are zero.

special_token(token)

Convert special token string to multi-stream token array.

Places token ID in first stream, padding tokens in other streams.