-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Real-time streaming #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real-time streaming #141
Conversation
The performance gain here is absurd. Is there anything I can pitch in here to finalize this PR @ggerganov? I could not exactly get the issue with the last commit "...stitch encoder outputs together" 😅 |
The "stitching" is basically instead of running 10 seconds of audio through the encoder at one pass, run for example 5 x 2 second chunks and combine the results in the cross-attention layer to get effectively what we would have gotten with 10 seconds directly. This would allow to process audio more often and be more real-time. The PR missies an option to enable/disable the encoder truncation - I currently hardcoded the values. It's not difficult to finalise, but I want to see how I will use it in the streaming examples - probably will get a better idea for the API. |
Force the entire audio chunk to be transcribed into a single segment
Used to limit the number of tokens in a segment. Useful to battle with word repetition when using partial encoder context
Used to overwrite the audio context size of the Encoder. For example, setting "audio_ctx = 512" will make it run about 3 times faster, processing about 10s of audio, instead of 30s. The transcription quality drops, but this can be used for real-time streaming purposes where performance is important.
Controls the max tokens per segment for the stream example
@meakbiyik |
Wow, this is great - thanks a lot @ggerganov! A quick follow-up question: would you recommend 2x speedup or reducing audio context size? Or can I mix them up, what was your experience? I do not quite understand why reducing the audio context should also reduce transcription accuracy, so I cannot be sure 😅 Also, interestingly I have noted that lowering step size improves much better transcription, so much so that using low step size + base model is better than 2x step size + small model. Is there anything going on behind the scenes that can explain this phenomenon? Does the option |
The 2x speed-up does not seem very useful yet in my experience, so I don't recommend using it. The step size observation is strange - as long as your hardware is capable to process the data in real time, then the bigger model should be always better, regardless of step size. Regarding the |
interesting, but why is there less data, particularly if the On step size, I observed that the transcription is "refined" every time the model reruns on the data it has already seen, and more refinements are better, which makes sense if the model has access to the current context of size
|
Yeah, actually you have a good point - for a fixed
Yes, correct. For example, if you have The |
Perfect, thanks a lot, all of this makes full sense! Will try to do that -kc thing quite soon. Buuut I got one final follow-up just to understand it better: what happens if length>audio_context? Does the model it trim from the end? Or is there a downsampling going on? |
Currently, it will trim from the end: |
A-ha, lovely. Thanks a lot again! |
According to #137 , I set -ac = 750, the result have lots of noise word “[buzzer] / [static] / [AUDIO OUT]”, how can I remove it? |
Currently, the only way is to manually replace these strings yourself (for example, using regex). |
[WIP in progress]
With the idea in #137 it is possible to reduce the time in the encoder multiple times.
This is beneficial for the
stream
example, because it already processes the audio in short chunks.The decoding quality seems to drop, but I think not significantly.
With the current parameters, I am able to run the following commands in real-time on MacBook M1 Pro:
This was not possible before.
Next thing to try is to run the
tiny
model in streaming mode in the browser using WASM with a step of 1 or 2 seconds.I think there is some chance it could actually work.