Skip to content

Incorrect log output when --dtw is enabled #3111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rotemdan opened this issue May 2, 2025 · 0 comments
Open

Incorrect log output when --dtw is enabled #3111

rotemdan opened this issue May 2, 2025 · 0 comments

Comments

@rotemdan
Copy link
Contributor

rotemdan commented May 2, 2025

Without --dtw:

whisper-cli test2.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 8 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: processing 'test2.wav' (5043374 samples, 315.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:02.520]   (upbeat music)
[00:00:02.520 --> 00:00:03.360]   Guess what?
[00:00:03.360 --> 00:00:09.440]   So, remember how you created a fire pool for me.
[00:00:09.440 --> 00:00:11.560]   - Kind of remember that.
[00:00:11.560 --> 00:00:12.720]   - Yeah, from the...
[00:00:12.720 --> 00:00:13.560]   - Wasn't it real?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:28.640 --> 00:00:30.960]   - Yeah, they look really, really important.
[00:00:30.960 --> 00:00:33.560]   All right, cool, I'm glad we agree.
[00:00:33.560 --> 00:00:38.220]   (upbeat music)
[00:00:38.220 --> 00:00:44.080]   Private internet access is the VPN service
[00:00:44.080 --> 00:00:46.180]   that encrypts all of your internet traffic
[00:00:46.180 --> 00:00:48.640]   and uses a safe, protected IP.
[00:00:48.640 --> 00:00:51.160]   And it's got tons of other useful features as well.
[00:00:51.160 --> 00:00:53.920]   Check it out now at the link below.
[00:00:53.920 --> 00:00:56.120]   All right, so, plan A.
[00:00:56.120 --> 00:00:58.600]   How long have you spent developing this plan?
[00:00:58.600 --> 00:01:00.960]   - Well, extend this, we'll have this thing here.
[00:01:00.960 --> 00:01:03.240]   We'll have people holding onto this.
[00:01:03.240 --> 00:01:04.080]   - Which people?
[00:01:04.080 --> 00:01:05.600]   - We can find people on the street
[00:01:05.600 --> 00:01:10.400]   and we'll also clamp it onto the railing up there.
[00:01:10.400 --> 00:01:13.360]   - Can I reveal something that you probably don't know?
[00:01:13.360 --> 00:01:15.320]   - There's pieces of information you don't have.
[00:01:15.320 --> 00:01:18.400]   - Okay, so when we put in these stairs,
[00:01:18.400 --> 00:01:22.760]   we'd ever actually intended for anyone to walk down them.
[00:01:23.920 --> 00:01:25.420]   These are the way up.
[00:01:25.420 --> 00:01:31.440]   Whoa, I knew that.
[00:01:31.440 --> 00:01:34.880]   Firepool installation, baby.
[00:01:34.880 --> 00:01:40.600]   Wow, this is a little nerve-wracking, mind me.
[00:01:40.600 --> 00:01:44.360]   So, instead of having this long pole go this way,
[00:01:44.360 --> 00:01:47.160]   like this way, we can attach it
[00:01:47.160 --> 00:01:50.880]   to several of these posts instead.
[00:01:50.880 --> 00:01:53.040]   Okay, so let's saw the Yolo plan.
[00:01:54.000 --> 00:01:55.400]   What do you mean the Yolo plan?
[00:01:55.400 --> 00:01:58.120]   - I want something simpler and faster than this.
[00:01:58.120 --> 00:01:59.360]   - Are you serious?
[00:01:59.360 --> 00:02:02.960]   - Maybe fewer people.
[00:02:02.960 --> 00:02:06.520]   One person will attach weights to it.
[00:02:06.520 --> 00:02:08.760]   We'll tape weights to this and rope.
[00:02:08.760 --> 00:02:09.600]   We have rope.
[00:02:09.600 --> 00:02:10.600]   - All right.
[00:02:10.600 --> 00:02:11.520]   - We're not doing this.
[00:02:11.520 --> 00:02:12.360]   - I'm not doing this.
[00:02:12.360 --> 00:02:13.880]   - Approached. - No, no.
[00:02:13.880 --> 00:02:15.800]   - You're not serious about the Yolo, right?
[00:02:15.800 --> 00:02:17.000]   - Yeah, yeah, Yolo plan.
[00:02:17.000 --> 00:02:18.680]   - No, you're not, no, no, you're not.
[00:02:18.680 --> 00:02:19.520]   - Let's give him stuff out of the way.
[00:02:19.520 --> 00:02:21.560]   - We're not Yoloing, I'm gonna tell the bond.
[00:02:21.560 --> 00:02:24.160]   - I don't wanna lose my job because he died.
[00:02:24.160 --> 00:02:27.760]   - The fire pole is going in today.
[00:02:27.760 --> 00:02:29.840]   - What happened to the safety concerns?
[00:02:29.840 --> 00:02:31.000]   Do we have insurance for this?
[00:02:31.000 --> 00:02:33.600]   - I think if anyone came in and inspected our implementation,
[00:02:33.600 --> 00:02:35.640]   they'd be really happy with it.
[00:02:35.640 --> 00:02:37.600]   - Oh yeah? - Yeah.
[00:02:37.600 --> 00:02:39.280]   - We'll have a cross brace.
[00:02:39.280 --> 00:02:40.680]   We'll attach it to the post.
[00:02:40.680 --> 00:02:42.840]   - Yeah, this all sounds incredibly stupid.
[00:02:42.840 --> 00:02:46.360]   - I feel like I should be wearing a hard hat.
[00:02:46.360 --> 00:02:50.200]   - Yeah, do we have a hard hat?
[00:02:51.200 --> 00:02:52.600]   (beep)

[... truncated ...]

With --dtw:

whisper-cli test2.wav --dtw base.en
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 1
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: alignment heads masks size = 160 B
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =  109.79 MB

system_info: n_threads = 4 / 8 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: processing 'test2.wav' (5043374 samples, 315.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:12.720 --> 00:00:13.560]   - Wasn't it real?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:11.560 --> 00:00:12.720]   - Yeah, from the...
[00:00:12.720 --> 00:00:13.560]   - Wasn't it real?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:09.440 --> 00:00:11.560]   - Kind of remember that.
[00:00:11.560 --> 00:00:12.720]   - Yeah, from the...
[00:00:12.720 --> 00:00:13.560]   - Wasn't it real?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:03.360 --> 00:00:09.440]   So, remember how you created a fire pool for me.
[00:00:09.440 --> 00:00:11.560]   - Kind of remember that.
[00:00:11.560 --> 00:00:12.720]   - Yeah, from the...
[00:00:12.720 --> 00:00:13.560]   - Wasn't it real?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?
[00:00:02.520 --> 00:00:03.360]   Guess what?
[00:00:03.360 --> 00:00:09.440]   So, remember how you created a fire pool for me.
[00:00:09.440 --> 00:00:11.560]   - Kind of remember that.
[00:00:11.560 --> 00:00:12.720]   - Yeah, from the...
[00:00:12.720 --> 00:00:13.560]   - Wasn't it real?
[00:00:13.560 --> 00:00:14.840]   - Up on the top of the mezzanine down.
[00:00:14.840 --> 00:00:16.720]   No, no, you made a fire pool.
[00:00:16.720 --> 00:00:17.720]   - Is it real?
[00:00:17.720 --> 00:00:19.200]   Is it an illusion?
[00:00:19.200 --> 00:00:21.400]   Only the editors will know.
[00:00:21.400 --> 00:00:22.600]   - By the end of the day,
[00:00:22.600 --> 00:00:25.360]   I want you to have a real fire pool installed
[00:00:25.360 --> 00:00:26.640]   that I can slide down.
[00:00:26.640 --> 00:00:28.640]   - You know, I have things to do, right?


whisper_print_timings:     load time =   183.36 ms
whisper_print_timings:     fallbacks =   2 p /   9 h
whisper_print_timings:      mel time =   192.15 ms
whisper_print_timings:   sample time =  5801.55 ms /  9386 runs (    0.62 ms per run)
whisper_print_timings:   encode time = 13922.25 ms /    12 runs ( 1160.19 ms per run)
whisper_print_timings:   decode time =  1053.59 ms /   197 runs (    5.35 ms per run)
whisper_print_timings:   batchd time = 18152.11 ms /  9123 runs (    1.99 ms per run)
whisper_print_timings:   prompt time =  4975.55 ms /  3767 runs (    1.32 ms per run)
whisper_print_timings:    total time = 44862.12 ms

The issue did not happen on older versions (one I had was from October 2024).

Seems to be a regression on some sort. It happens both on latest code and on latest stable build (1.7.5).

The t_dtw timestamps themselves and the standard timestamps in the JSON object do seem to be accurate (t_dtw timestamps are converted to timestamps in Echogarden when this option is enabled, so I can test them directly). The problem may be more about the log output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant