Skip to content

Implement threaded parallel fetches. #263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 23, 2013
Merged

Implement threaded parallel fetches. #263

merged 6 commits into from
Jul 23, 2013

Conversation

seancribbs
Copy link

The implementation is based on a static worker pool that is started when the first multiget operation is performed. Workers can be reused across multiget operations and feed their responses back to the requestor via a queue. This is similar to the ThreadPoolExecutor idea in Java. There are opportunities to make this pool configurable, but it is not for the moment, instead using the CPU count as a measure for how many workers to start.

This also adds a little benchmark utility that mimics Ruby's benchmark.rb. Unfortunately the benchmarks indicate that HTTP gets no benefit from parallel fetch, in fact, it suffers. I am unable to understand how this is possible unless the payload size is too small, such that generating the requests and parsing the responses dominates network latency. Another possibility would be to allow the pool to use multiprocessing instead of threading, which gets around the GIL but incurs the cost of crossing process boundaries.

$ python -m riak.client.multiget
Benchmarking multiget:
      CPUs: 8
   Threads: 8
      Keys: 10000

             user         system       ( real         )
populate         2.460000     0.210000 (    10.640000 )


Rehearsal -------------------------------------------------
http seq        11.750000     4.160000 (    26.500000 )
http multi      21.650000    20.810000 (    31.190000 )
pbc seq          2.010000     0.180000 (     6.530000 )
pbc multi        4.080000     1.990000 (     4.840000 )
-----------------------------------------------------------

             user         system       ( real         )
http seq        13.040000     4.290000 (    28.080000 )
http multi      21.340000    20.430000 (    30.660000 )
pbc seq          2.110000     0.200000 (     6.570000 )
pbc multi        3.890000     1.920000 (     4.580000 )

Addresses #225.

@shuhaowu
Copy link
Contributor

As just a comment, is it possible not using the python threading?

__all__ = ['bm', 'bmbm']


def bmbm():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cryptic!

@seancribbs
Copy link
Author

@shuhaowu I'm already using threading.

@ghost ghost assigned seancribbs Jun 26, 2013
@shuhaowu
Copy link
Contributor

Yeah. I was concerned about this as using threading to parallelize operations in Python usually cause a performance degrade (unless the IO wait is really long, which it shouldn't be in this case).

"""

def __init__(self, size=POOL_SIZE):
self._inq = Queue()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this Queue is unbounded. Although, given the efficacy of this feature combined with a typical usage pattern of volume of keys this might only bite someone in the absolute worst case.

@mgodave
Copy link

mgodave commented Jul 23, 2013

👍

seancribbs pushed a commit that referenced this pull request Jul 23, 2013
Implement threaded parallel fetches.
@seancribbs seancribbs merged commit f8cd2e5 into master Jul 23, 2013
@seancribbs seancribbs deleted the gh225-multi-get branch July 23, 2013 17:53
@seancribbs seancribbs removed their assignment May 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants