Skip to content

Speed up Gauss3.gauss() by using the imglib2 "Parallelization" class #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 9, 2021

Conversation

maarzt
Copy link
Contributor

@maarzt maarzt commented Nov 5, 2021

This PR changes how multi threading is done in the class Gauss3 and the package net.imglib2.algorithm.convolution. The code now simply uses the multi threading that's build into LoopBuilder.

This noticeably increases the performance. The reason being that Gauss3.gauss(sigma, source, target) no longer creates it's own ExecutorService. It instead uses imglib2's Parallelization class which by default uses the global ForkJoinPool. This reduces the number of threads that the code creates, and thereby improves the performance.

For Labkit's pixel classification the speed difference is roughly 30%. (Labkit makes heavy use of imglib2 convolutions)
A simple benchmark also shows the difference. Blurring a small 100x100 pixel image with Gauss3.gauss now takse 0.22 ms on my machine. The old code required 0.58 ms.

This simplifies the code, but also improves the performance,
because Gauss3.gauss(sigma, source, target) will no longer create
it's own ExecutorService.

This also means that the multi threading behaviour of Gauss3
can now conviniently be controled using the Parallelization class.
For example:

Parallelization.runSingleThreaded(() -> {
	Gauss3.gauss(sigma, source, target);
});
@@ -126,10 +127,9 @@
*/
public static < S extends NumericType< S >, T extends NumericType< T > > void gauss( final double[] sigma, final RandomAccessible< S > source, final RandomAccessibleInterval< T > target ) throws IncompatibleTypeException
{
final int numthreads = Runtime.getRuntime().availableProcessors();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the javadoc:
"Computation is multi-threaded with as many threads as processors available."
should be something like
"...Computation may be multi-threaded, according to the current {@link Parallelization} context. (By default, runs on the common ForkJoinPool ...)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I updated the javadoc.

@tpietzsch
Copy link
Member

Very nice!

Could you put a bit more work into Gauss3? I would

  • deprecate all static methods that specify number of threads or ExecutorService, and
  • suggest in the javadoc of those methods how to get the same behaviour with Parallelization settings instead.

@tpietzsch
Copy link
Member

I played a bit with it, and the speedup seems to be not only due to the creation/shutdown of the ExecutorService.
There is overhead that scales also to bigger images (i.e. 100x100x100), where I would otherwise expect the creation/shutdown overhead to disappear relative to the actual computations.

It seems to be also due to using the common ForkJoinPool instead of a FixedThreadPool.
I can basically get the same performance by passing Parallelization.getExecutorService() to the old code.
Just passing a FixedThreadPool with the number of threads == parallelism level of the ForkJoinPool does not work.
I can get closer by putting a FixedThreadPool with more threads, but not as good as the common pool.
Very interesting.

@maarzt
Copy link
Contributor Author

maarzt commented Nov 9, 2021

@tpietzsch Thanks for the review. I updated the javadoc as you suggested.

Strange that the ForkJoinPool is overall faster than the FixedThreadPool. If have no idea why.

@maarzt
Copy link
Contributor Author

maarzt commented Nov 9, 2021

Is there something else to do or is the PR ready for merge?

@tpietzsch tpietzsch merged commit aae1f4d into master Nov 9, 2021
@tpietzsch
Copy link
Member

Strange that the ForkJoinPool is overall faster than the FixedThreadPool. If have no idea why.

If I understand correctly, the ForkJoinPool will start new threads to maintain its parallelism level if other threads are stalled. I think that must be it, but I don't really see where that would help the convolution

@tpietzsch tpietzsch deleted the gauss3-parallelization branch November 9, 2021 09:22
@maarzt
Copy link
Contributor Author

maarzt commented Nov 9, 2021

Cool it's merged 👍 ✨

I have a slightly different understanding of the ForkJoinPool: Usually when using a thread pool. It goes like this: All the tasks are submitted to the pool and thread who submitted the tasks waits for them to finish. During this time this thread is basically blocked.

This is different when the ForkJoinPool is used together with ForkJoinTasks. Here the thread submitting the tasks, will not only wait for the tasks to finish but it will as well help processing them. That's the trick.

@tpietzsch
Copy link
Member

tpietzsch commented Nov 9, 2021

This is different when the ForkJoinPool is used together with ForkJoinTasks. Here the thread submitting the tasks, will not only wait for the tasks to finish but it will as well help processing them. That's the trick.

I don't think that's the only thing. The waiting thread doesn't consume any CPU really, so it would be just a matter of adding more threads to the FixedThreadPool. That actually helps. On my computer, the common pool parallelism is 11, numProcessors is 12. If I use a FixedThreadPool with 24 threads, I get much closer to common pool performance than with a FixedThreadPool of 12 threads. But common pool is still a bit faster. It could be that this is due to other effects like different number of tasks, etc.

@maarzt
Copy link
Contributor Author

maarzt commented Nov 9, 2021

On my computer the story is completely different. The CPU has 4 cores and does Hyperthreading, so it can actively run 8 threads in parallel. There's not much of a performance difference between a FixedThreadPool with 4 or 8 threads, and ForkJoinPool also shows similar performance...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants