-
Notifications
You must be signed in to change notification settings - Fork 4.2k
vad : add initial Voice Activity Detection (VAD) support #3065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5758650
to
9f0ed3d
Compare
Are there plans to add vad support for |
I think it would be nice to get an initial version merged first as this PR is quite large as it is. I can then start looking at adding support to the server, and hopefully during that time people can start trying this out and see what works and does not work. I'm adding the remaining options to whisper-cli now and after that this should be ready for review. |
b59768b
to
798695f
Compare
I am doing some initial testing using long audio and I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results. |
Ah yes, currently what is done is only the samples that are detected to contain speech are passed to With commit the output is now more aligned with the original audio input. gb0 without VAD./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav
...
[00:00:00.000 --> 00:00:03.240] Good morning, this Tuesday is Election Day.
[00:00:03.240 --> 00:00:06.000] After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640] the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.140] about our nation's future.
[00:00:10.140 --> 00:00:13.740] I encourage all Americans to go to the polls and vote.
[00:00:13.740 --> 00:00:16.140] Election season brings out the spirit of competition
[00:00:16.140 --> 00:00:18.080] between our political parties.
[00:00:18.080 --> 00:00:20.280] And that competition is an essential part
[00:00:20.280 --> 00:00:21.780] of a healthy democracy.
[00:00:21.780 --> 00:00:23.520] But as the campaigns come to a close,
[00:00:23.520 --> 00:00:25.980] Republicans, Democrats, and independents
[00:00:25.980 --> 00:00:29.120] can find common ground on at least one point.
[00:00:29.120 --> 00:00:31.560] Our system of representative democracy
[00:00:31.560 --> 00:00:34.440] is one of America's greatest strengths.
[00:00:34.440 --> 00:00:36.240] The United States was founded on the belief
[00:00:36.240 --> 00:00:38.240] that all men are created equal.
[00:00:38.240 --> 00:00:41.440] Every election day, millions of Americans of all races,
[00:00:41.440 --> 00:00:43.440] religions, and backgrounds step into voting
[00:00:43.440 --> 00:00:45.280] booths throughout the nation.
[00:00:45.280 --> 00:00:47.780] Whether they are rich or poor, old or young,
[00:00:47.780 --> 00:00:50.680] each of them has an equal share in choosing the path
[00:00:50.680 --> 00:00:52.440] that our country will take.
[00:00:52.440 --> 00:00:54.920] And every ballot they cast is a reminder
[00:00:54.920 --> 00:00:58.280] that our founding principles are alive and well.
[00:00:58.280 --> 00:00:59.760] Voting is one of the great privileges
[00:00:59.760 --> 00:01:01.760] of American citizenship.
[00:01:01.760 --> 00:01:04.520] And it has always required brave defenders.
[00:01:04.520 --> 00:01:06.000] As you head to the polls next week,
[00:01:06.000 --> 00:01:08.400] remember the sacrifices that have been made
[00:01:08.400 --> 00:01:11.040] by generations of Americans in uniform
[00:01:11.040 --> 00:01:13.000] to preserve our way of life.
[00:01:13.000 --> 00:01:14.840] From Bunker Hill to Baghdad,
[00:01:14.840 --> 00:01:16.740] the men and women of American armed forces
[00:01:16.740 --> 00:01:19.940] have been devoted guardians of our democracy.
[00:01:19.940 --> 00:01:21.840] All of us owe them and their families
[00:01:21.840 --> 00:01:25.240] a special debt of gratitude on Election Day.
[00:01:25.240 --> 00:01:27.520] Americans should also remember the important example
[00:01:27.520 --> 00:01:30.080] that our election set throughout the world.
[00:01:30.080 --> 00:01:32.080] Young democracies from Georgia and Ukraine
[00:01:32.080 --> 00:01:34.560] to Afghanistan and Iraq can look to the United States
[00:01:34.560 --> 00:01:37.520] for proof that self-government can endure.
[00:01:37.520 --> 00:01:40.400] And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080] can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.200] For more than two centuries,
[00:01:45.200 --> 00:01:47.120] Americans have demonstrated the ability
[00:01:47.120 --> 00:01:49.600] of free people to choose their own leaders.
[00:01:49.600 --> 00:01:51.880] Our nation has flourished because of its commitment
[00:01:51.880 --> 00:01:54.640] to trusting the wisdom of our citizenry.
[00:01:54.640 --> 00:01:57.200] In this year's election, we will see this tradition
[00:01:57.200 --> 00:02:00.280] continue, and we will be reminded once again
[00:02:00.280 --> 00:02:02.640] that we are blessed to live in a free nation
[00:02:02.640 --> 00:02:05.520] guided by the will of the people.
[00:02:05.520 --> 00:02:06.720] Thank you for listening. gb0 with VAD./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav --vad --vad-threshold 0.5 --vad-model models/for-tests-silero-v5.1.2-ggml.bin
...
[00:00:00.000 --> 00:00:03.280] Good morning, this Tuesday is Election Day.
[00:00:03.280 --> 00:00:06.000] After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.600] the time has come for Americans to make important decisions
[00:00:08.600 --> 00:00:10.200] about our nation's future.
[00:00:10.200 --> 00:00:13.790] Encourage all Americans to go to the polls and vote.
[00:00:13.790 --> 00:00:16.120] Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.060] between our political parties.
[00:00:18.060 --> 00:00:20.230] And that competition is an essential part
[00:00:20.230 --> 00:00:21.820] of a healthy democracy.
[00:00:21.820 --> 00:00:23.550] But as the campaigns come to a close,
[00:00:23.550 --> 00:00:25.960] Republicans, Democrats, and independents
[00:00:25.960 --> 00:00:29.180] can find common ground on at least one point.
[00:00:29.180 --> 00:00:31.530] Our system of representative democracy
[00:00:31.530 --> 00:00:34.470] is one of America's greatest strengths.
[00:00:34.470 --> 00:00:36.250] The United States was founded on the belief
[00:00:36.250 --> 00:00:38.310] that all men are created equal.
[00:00:38.310 --> 00:00:40.740] Every election day, millions of Americans
[00:00:40.740 --> 00:00:42.630] of all races, religions, and backgrounds
[00:00:42.630 --> 00:00:45.340] step into voting booths throughout the nation.
[00:00:45.340 --> 00:00:48.530] Whether they are rich or poor, old or young, each of them
[00:00:48.530 --> 00:00:50.660] has an equal share in choosing the path
[00:00:50.660 --> 00:00:52.480] that our country will take.
[00:00:52.480 --> 00:00:54.910] And every ballot they cast is a reminder
[00:00:54.910 --> 00:00:58.330] that our founding principles are alive and well.
[00:00:58.330 --> 00:00:59.760] Voting is one of the great privileges
[00:00:59.760 --> 00:01:01.810] of American citizenship.
[00:01:01.810 --> 00:01:04.550] And it is always required brave defenders.
[00:01:04.550 --> 00:01:06.050] As you head to the polls next week,
[00:01:06.050 --> 00:01:08.380] remember the sacrifices that have been made
[00:01:08.380 --> 00:01:11.580] by generations of Americans in uniform to preserve
[00:01:11.580 --> 00:01:13.010] our way of life.
[00:01:13.010 --> 00:01:15.450] From Bunker Hill to Baghdad, the men and women
[00:01:15.450 --> 00:01:17.030] of American armed forces have been
[00:01:17.030 --> 00:01:19.990] devoted guardians of our democracy.
[00:01:19.990 --> 00:01:21.790] All of us owe them and their families
[00:01:21.790 --> 00:01:25.260] a special debt of gratitude on election day.
[00:01:25.260 --> 00:01:27.520] Americans should also remember the important example
[00:01:27.520 --> 00:01:30.090] that our elections set throughout the world.
[00:01:30.090 --> 00:01:32.070] Young democracies from Georgia and Ukraine
[00:01:32.070 --> 00:01:34.520] to Afghanistan and Iraq can look to the United States
[00:01:34.520 --> 00:01:37.450] for proof that self-government can endure.
[00:01:37.450 --> 00:01:40.400] And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080] can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.690] For more than two centuries, Americans
[00:01:45.690 --> 00:01:47.730] have demonstrated the ability of free people
[00:01:47.730 --> 00:01:49.600] to choose their own leaders.
[00:01:49.600 --> 00:01:51.830] Our nation has flourished because of its commitment
[00:01:51.830 --> 00:01:54.630] to trusting the wisdom of our citizenry.
[00:01:54.630 --> 00:01:58.460] In this year's election, we will see this tradition continue.
[00:01:58.460 --> 00:02:00.220] And we will be reminded once again
[00:02:00.220 --> 00:02:02.590] that we are blessed to live in a free nation
[00:02:02.590 --> 00:02:05.490] guided by the will of the people.
[00:02:05.490 --> 00:02:06.650] Thank you for listening.
|
This commit add support for Voice Activity Detection (VAD). When enabled this feature will process the audio input and detect speech segments. This information is then used to reduce the number of samples that need to be processed by whisper_full. This initial support is based on the Silero VAD model which needs to be converted to GGML format: ```console $ (venv) pip install silero-vad $ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin ``` There is test the tests the VAD support in isolation: ```console $ cmake --build build --target test-vad && \ ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV ``` And one that tests VAD in combination with whisper_full: ```console $ cmake --build build --target test-vad-full && \ ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV ``` Resolves: ggml-org#3003
Example of format: ```console $ ./build/bin/whisper-cli --help usage: ./build/bin/whisper-cli [options] file0 file1 ... supported audio formats: flac, mp3, ogg, wav options: -h, --help [default] show this help message and exit ... Voice Activity Detection (VAD) options: -v, --vad [false ] enable Voice Activity Detection (VAD) -vm FNAME, --vad-model FNAME [ ] VAD model path -vt N, --vad-threshold N [0.50 ] VAD threshold for speech recognition -vs N, --vad_window_size_samples N [512 ] VAD window size -vspd N, --vad_min_speech_duration_ms N [250 ] VAD min speech duration -vsd N, --vad_min_silence_duration_ms N [100 ] VAD min silence duration -vmsd N, --vad_max_speech_duration_s N [FLT_MAX] VAD max speech duration -vp N, --vad_speech_pad_ms N [30 ] VAD speech padding -vo N, --vad_samples_overlap N [0.10 ] VAD samples overlap size ``` The main reason for the separate VAD options section is that the VAD options are longer and made the rest look a little ugly.
This commit adds a job to the CI pipeline to test the VAD model. This will only test the VAD model in isolation, that is it does not test whisper_full.
This commit adds a mapping of the original audio timestamps to the timestamps of the segments in the VAD (Voice Activity Detection) process. The motivation for this change is when we process the original audio signal and only pass the speech segments to whisper_full, the timestamps that whisper returnes when calling functions like whisper_full_get_segment_t0 are the timestamps for the "VAD" segments and not the original audio. The values are not identical to the the timestamps processed without VAD enabled but they are close, and hopefully close enough.
Free filtered samples after VAD processing.
This commit extracts the VAD processing from whisper_full_with_state to an separate function to make the code more readable.
This commit modifies the VAD code to only use the CPU backend for VAD processing. There is currently an issue with the GPU backend which I need to investigate further. It is also not clear to me if running the VAD processing on a GPU is actually beneficial.
This commit fixes a mistake in the usage of strcmp where I missed the actual comparision evaluation. This commit also changes the sampling method from WHISPER_SAMPLING_GREEDY to WHISPER_SAMPLING_BEAM_SEARCH in the to match the default cli behavior.
I've discovered an issue when running the VAD processing on a GPU(CUDA) backend where it produces different probabilities than the CPU backend does. I'm looking into this now. I'm also not sure about if there is a benefit to using a GPU for the VAD processing. The tensors are quite small and we process samples sequentially in chunks of 512 (padded with reflective padding of 64 samples left and right for a total of 640 samples) as it is right now. The overhead of the memory transfer, kernel launch, synchronization etc might cost more than we actually gain from using a GPU. I'm also struggling getting this to work for CUDA and before spending more time on trying to sort this out I'd like to hear what others think and if I should peruse this or if we should limit the VAD processing to CPU only, at least for now? |
I check for repeating timecodes with 4 or more lines repeating. Usually there are around 30 seconds in duration. If less than 9 seconds automatically delete them. Then adjust timecodes to evenly distrubute the timecodes from the 20 seconds or 30 seconds whatever it is for each subtitle line that has a repeating timecode.. Once done, insert them back into the original vtt with the adjusted timecodes. So with VAD I'd imagine you have to make note of the timecodes removed and calculate duration for each chunk and reinsert them. However, this is more complicated cuz each time it would have to shift the timecodes each time for the subs after each section that was inserted. I think that's not practical and but I suppose it would be doable. Myself I just remove silence and/or hiss with ffmpeg when making a audiobook then whisper.cpp to transcribe that avoiding any silent sections. Pretty much what VAD does I'm guessing. showing example the (30) is the duration of repeating timecodes Created file: 12-12-06.953-->12-12-36.960(31).vtt Created file: 12-22-34.770-->12-23-07.097(33).vtt Created file: 14-14-52.974-->14-15-22.981(31).vtt Created file: 14-52-01.393-->14-52-01.893(1).vtt Created file: 15-09-02.124-->15-09-07.131(6).vtt Created file: 15-09-07.131-->15-09-13.131(6).vtt Created file: 15-23-20.611-->15-23-44.642(25).vtt Press ENTER to delete any timecodes with a duration less than 10 seconds Deleted 02-03-54.118-->02-04-02.948(9).vtt |
|
||
ggml_backend_tensor_set(h_in, h_state.data(), 0, hidden_dim * sizeof(float)); | ||
ggml_backend_tensor_set(c_in, c_state.data(), 0, hidden_dim * sizeof(float)); | ||
|
||
if (!ggml_graph_compute_helper(sched, gf, vctx->n_threads)) { | ||
WHISPER_LOG_ERROR("%s: failed to compute VAD graph\n", __func__); | ||
break; | ||
} | ||
|
||
// Update the LSTM states | ||
ggml_backend_tensor_get(h_out, h_state.data(), 0, hidden_dim * sizeof(float)); | ||
ggml_backend_tensor_get(c_out, c_state.data(), 0, hidden_dim * sizeof(float)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we get the resulting h_state
and c_state
and set them on the next iterations as inputs. This get/set should be possible to avoid by allocating the h_state
and c_state
in a separate buffer - similar to how we allocate the KV cache tensors in llama.cpp
. This way their content will be persisted across the evaluation.
// Check if the timestamp falls within this segment. | ||
if (t0 >= segment.vad_start && t0 <= segment.vad_end) { | ||
float proportion = 0.0f; | ||
if (segment.vad_end > segment.vad_start) { | ||
proportion = (t0 - segment.vad_start) / (segment.vad_end - segment.vad_start); | ||
} | ||
float orig_t0 = segment.orig_start + proportion * (segment.orig_end - segment.orig_start); | ||
return (int64_t)(orig_t0 * 100); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is the best logic for restoring the timestamps because it is scaling the speech back to the original length of the segment.
For example, if we consider the following 30 seconds audio (.
- silence, x
- speech):
.............................................................xxxxx
hello
This will produce the final transcribed segment as something like:
[00:00:00.000 --> 00:00:30.000] hello
While it would be better to produce the more accurate:
[00:00:25.000 --> 00:00:30.000] hello
We should eventually figure out why the GPU inference fails, but we can do it later. For now, we should add a way to easily enable GPU vad, and have it disabled by default. |
This commit enables GPU support for VAD processing but defaults to false as there is an issue with this that I've get to figure out. I'll revisit this in the near future.
vad can't process this file. |
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.
This initial support is based on the Silero VAD model which needs to be converted to GGML format:
There is test the tests the VAD support in isolation:
And one that tests VAD in combination with whisper_full:
Resolves: #3003
whisper-cli example output