You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the main cli program to transcribe audio containing non-ASCII characters and outputting detailed JSON --output-json-full with features DTW enabled, the transcription.tokens[*].text field sometimes contains garbled characters (represented as � ). However, the main transcription.text field appears correct.
Noted that the single character 岸 is shown correctly in text field but splitted into 2 tokens, the utf-8 encoding for 岸 is E5 B2 B8, corresponding to text field of the 2 tokens: E5 B2 and B8.
imo the root cause for this is that the tokenizer sometimes splits single non-ASCII characters into multiple tokens.
I found a workaround for decoding this using python, simply by concatenating consecutive tokens that is garbled. I'm not sure if this will/should be solved from the whisper.cpp side.
Description:
When using the
main
cli program to transcribe audio containing non-ASCII characters and outputting detailed JSON--output-json-full
with featuresDTW
enabled, thetranscription.tokens[*].text
field sometimes contains garbled characters (represented as�
). However, the maintranscription.text
field appears correct.For example:
Noted that the single character
岸
is shown correctly intext
field but splitted into 2tokens
, the utf-8 encoding for岸
isE5 B2 B8
, corresponding totext
field of the 2 tokens:E5 B2
andB8
.imo the root cause for this is that the tokenizer sometimes splits single non-ASCII characters into multiple tokens.
I found a workaround for decoding this using python, simply by concatenating consecutive tokens that is garbled. I'm not sure if this will/should be solved from the whisper.cpp side.
The text was updated successfully, but these errors were encountered: