Skip to content

Add AudioDecoder.get_samples_played_in_range() public method #555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 13, 2025

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Mar 13, 2025

Towards #549

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 13, 2025
@NicolasHug NicolasHug mentioned this pull request Mar 12, 2025
7 tasks

def get_samples_played_in_range(
self, start_seconds: float, stop_seconds: Optional[float] = None
) -> AudioSamples:
Copy link
Member Author

@NicolasHug NicolasHug Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like start_seconds should default to the stream's beginning, but that can be done later.

Copy link
Contributor

@scotts scotts Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, we should also mirror that in the video API. We could try to play the same trick that range() does, where the semantics of the first parameter changes based on the number of parameters, but I'd (softly) rather not do that.

Copy link
Member Author

@NicolasHug NicolasHug Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll update #150 eventually with more of these desirable default behaviors. That would be a good onboarding PR.

@@ -114,3 +114,28 @@ def __len__(self):

def __repr__(self):
return _frame_repr(self)


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below: the audio decoding API returns the new AudioSamples class rather than a pure Tensor. I think it has the following benefits:

  • users can keep track of the sample_rate within that struct without having to handle it separately
  • for some edge-cases, like with our mp3 test asset, the stream's beginning isn't 0. So we also returns pts_seconds, which may not always be equal to start_seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, and it mirrors the video API. In the video API, __getitem__() returns just the tensor, but the named methods return Frame or FrameBatch. I think we should probably do that for audio as well.

sample_rate: int

def __post_init__(self):
# This is called after __init__() when a Frame is created. We can run
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This is called after __init__() when a Frame is created. We can run
# This is called after __init__() when an AudioSamples instance is created. We can run

@@ -854,7 +854,7 @@ VideoDecoder::AudioFramesOutput VideoDecoder::getFramesPlayedInRangeAudio(

if (startSeconds == stopSeconds) {
// For consistency with video
return AudioFramesOutput{torch::empty({0}), 0.0};
return AudioFramesOutput{torch::empty({0, 0}), 0.0};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by, the video APIs return something of shape (0, C, H, W) (where C H W are from the metadata). Here, (0, 0) is the best we can do, at least it preserves the number of dimensions.

output_pts_seconds = first_pts

num_samples = frames.shape[1]
last_pts = first_pts + num_samples / self.metadata.sample_rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't last_pts be a float here? Is that your intention, or should we call round()?

Copy link
Member Author

@NicolasHug NicolasHug Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's a float. I didn't add the _seconds everywhere, but they're all in seconds. I can do it if you think it improves clarity.

We do call round() just below, which I think is enough? Or is there an edge-case I'm missing?

@NicolasHug NicolasHug merged commit 1fd20b2 into pytorch:main Mar 13, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants