Skip to content

Feature request for transformers use-cases #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zucchini-nlp opened this issue May 9, 2025 · 2 comments
Open

Feature request for transformers use-cases #673

zucchini-nlp opened this issue May 9, 2025 · 2 comments

Comments

@zucchini-nlp
Copy link

zucchini-nlp commented May 9, 2025

🚀 The feature

Hi 👋

First of all, huge thanks to you and the team, the latest torchcodec release with audio support is fantastic! It's a long-awaited feature

I'm the maintainer of multimodal models in transformers and I'm thinking to use torchcodec to load multimodal data for MLLMs. Looking forward for a stable version to be released. For now, I’ve been testing the latest release and noticed a few points that might be useful to consider for future support.

  1. Mono channel audio support: Some audio models (like Whisper from Hugging Face) only support mono-channel input. It would be helpful if audio loading allowed channel selection or converted stereo to mono optionally.

  2. Fallback for video files with no audio: When loading audio from a video file that has no audio stream, an error is raised currently. A more flexible behavior would be to return None, similar to how moviepy handles it and can be checked as if clip.audio is not None.

  3. Loading from URL: Loading audio/video from URLs seems to work for some urls I have tested with, though I couldn’t find in the docs whether URL input is officially supported. Hope it will be officially supported for the stable release

  4. Video decoder issues with avi format: When trying to load avi files, the decoder fails to infer duration and related metadata, which prevents sampling frames by seconds. Loading the same video saved as mp4 resolves the issue. You can try this video as an example.

Let me know if you'd like me to file any of these separately or provide reproducible examples. Thanks again for the awesome work!

Motivation, pitch

No response

@NicolasHug
Copy link
Member

NicolasHug commented May 12, 2025

Hi @zucchini-nlp

Thanks a lot for the great feedback!

  1. I've opened Allow user to choose num_channels in AudioDecoder #675 to keep track of this feature. I think we should be able to implement it for the next release.
  2. As you noted, TorchCodec raises and error instead of returning None. We're going by the principle that TorchCodec should make it very obvious when something goes wrong. Instead of using if audio is not None, users can catch errors within a try/except statement, which is probably just a matter of taste.
  3. We have updated the docstrings of the AudioDecoder and VideoDecoder now, and both should indicate that URLs are supported. Let us know if you encounter issues with some specific URLs!
  4. Thank you for sharing the video! I've opened 2 follow-up issues on this: Approximate mode fails on video #677 and ZeroDivisionError when accessing metadata #676. To unblock you, if what you need is just to decode some frames, I think you should be able to decode them by using seek_mode="exact". If you need to access metadata, use seek_mode="approximate". There is definitely something wrong in the way TorchCodec is handling this video, it's probably related to the metadata, so we'll investigate a bit more. I think the reason it works when you re-encode into mp4 is because the encoding is able to "fix" the metadata problem. I.e. it's more related to the metadata of these specific videos, rather than to the format itself.

@zucchini-nlp
Copy link
Author

Thanks a lot @NicolasHug ! Looking forward for future releases 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants