Skip to content

Call MeshRayCast cast_ray without mut #19176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

powei-lin
Copy link

@powei-lin powei-lin commented May 11, 2025

Objective

  • cast_ray always clean up the storage first without checking anything. So I don't see why it needs to store those values.
        self.hits.clear();
        self.culled_list.clear();
        self.output.clear();
  • The benefit of making cast_ray without &mut self is that we can easily parallelize this call. For example I was trying to simulate lidar. The original implementation makes the bottleneck of simulating thousands of ray cast on one CPU thread because MeshRayCast is SystemParam and it requires &mut self to call cast_ray. After removing &mut self you can easily use it with rayon par_iter.

Solution

  • Remove &mut self from cast_ray.

Testing

  • Did you test these changes? If so, how? It's a simple change. Need to fix the api and example. All the unit test still passed.
  • Are there any parts that need more testing?
  • How can other people (reviewers) test your changes? Is there anything specific they need to know?
  • If relevant, what platforms did you test these changes on, and are there any important ones you can't test?

Copy link
Contributor

Welcome, new contributor!

Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨

@powei-lin powei-lin changed the title Remove mut Call MeshRayCast cast_ray without mut May 11, 2025
@powei-lin powei-lin marked this pull request as ready for review May 11, 2025 14:57
@hukasu
Copy link
Contributor

hukasu commented May 12, 2025

it is to cache the capacity of the vectors, if every call to cast_ray needs to allocate a 500 items long vector it would be really slow

@powei-lin
Copy link
Author

@hukasu It's actually slower if you make it require &mut self. Cast 1000 rays in parallel is way faster than saving the time of allocating vectors but only doing it in one thread. If you have a better suggestion I can change the implementation. But I hope calling cast_ray doesn't require &mut self

@hukasu
Copy link
Contributor

hukasu commented May 12, 2025

did you already bench it to be affirming that?

@janhohenheim janhohenheim added S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help C-Performance A change motivated by improving speed, memory usage or compile times A-Picking Pointing at and selecting objects of all sorts labels May 12, 2025
@powei-lin
Copy link
Author

powei-lin commented May 12, 2025

I'm testing on a m4 mac mini base model. It has 4 large cores and 6 small cores.
Screenshot 2025-05-11 at 9 09 00 PM

Each ray cast has ~110 aabb hits. There are 1200 rays. I'm using cargo run -r with opt-level = 3
The original implementation took 2_696_871 micro seconds per frame.
The new implementation with par_iter took 466_937 micro seconds per frame.

@atlv24
Copy link
Contributor

atlv24 commented May 12, 2025

thread_local cache could be the best of both worlds

@janhohenheim janhohenheim added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged S-Needs-Review Needs reviewer attention (from anyone!) to move forward and removed S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels May 12, 2025
Copy link
Member

@Vrixyz Vrixyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Alternatively, cast_ray could take into parameters the offending structures needed for its call, so we can instantiate as much as we need (1 per thread) and reuse them, rather than reallocating them each time but in parallel.

As there has been a positive benchmark for stress test, I'm approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Picking Pointing at and selecting objects of all sorts C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants