Skip to content

gh-135551: Change how sorting picks minimum run length #135553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 27, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Initial stab at implementing Stefan Pochmann's spiffy new minrun scheme.
  • Loading branch information
tim-one committed Jun 16, 2025
commit 0bd2fd757913e6519e3b3a1b05bfcaba927ca21e
1 change: 1 addition & 0 deletions Misc/ACKS
Original file line number Diff line number Diff line change
Expand Up @@ -1480,6 +1480,7 @@ Jean-François Piéronne
Oleg Plakhotnyuk
Anatoliy Platonov
Marcel Plch
Stefan Pochmann
Kirill Podoprigora
Remi Pointel
Jon Poler
Expand Down
51 changes: 26 additions & 25 deletions Objects/listobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -1684,10 +1684,7 @@ sortslice_advance(sortslice *slice, Py_ssize_t n)
/* Avoid malloc for small temp arrays. */
#define MERGESTATE_TEMP_SIZE 256

/* The largest value of minrun. This must be a power of 2, and >= 1, so that
* the compute_minrun() algorithm guarantees to return a result no larger than
* this,
*/
/* The largest value of minrun. This must be a power of 2, and >= 1 */
#define MAX_MINRUN 64
#if ((MAX_MINRUN) < 1) || ((MAX_MINRUN) & ((MAX_MINRUN) - 1))
#error "MAX_MINRUN must be a power of 2, and >= 1"
Expand Down Expand Up @@ -1748,6 +1745,12 @@ struct s_MergeState {
* of tuples. It may be set to safe_object_compare, but the idea is that hopefully
* we can assume more, and use one of the special-case compares. */
int (*tuple_elem_compare)(PyObject *, PyObject *, MergeState *);

/* Varisbles used for minrun computation. The "ideal" minrun length is
* the infinite precision listlen / 2**e, which is represented as the
* marhematical value of mr_int + mr_frac / 2**e.
*/
Py_ssize_t mr_int, mr_frac, mr_current_frac, mr_e, mr_mask;
};

/* binarysort is the best method for sorting small arrays: it does few
Expand Down Expand Up @@ -2209,6 +2212,16 @@ merge_init(MergeState *ms, Py_ssize_t list_size, int has_keyfunc,
ms->min_gallop = MIN_GALLOP;
ms->listlen = list_size;
ms->basekeys = lo->keys;

ms->mr_int = list_size;
ms->mr_e = 0;
while (ms->mr_int >= MAX_MINRUN) {
ms->mr_int >>= 1;
++ms->mr_e;
}
ms->mr_mask = (1 << ms->mr_e) - 1;
ms->mr_frac = list_size & ms->mr_mask;
ms->mr_current_frac = 0;
}

/* Free all the temp memory owned by the MergeState. This must be called
Expand Down Expand Up @@ -2686,27 +2699,15 @@ merge_force_collapse(MergeState *ms)
return 0;
}

/* Compute a good value for the minimum run length; natural runs shorter
* than this are boosted artificially via binary insertion.
*
* If n < MAX_MINRUN return n (it's too small to bother with fancy stuff).
* Else if n is an exact power of 2, return MAX_MINRUN / 2.
* Else return an int k, MAX_MINRUN / 2 <= k <= MAX_MINRUN, such that n/k is
* close to, but strictly less than, an exact power of 2.
*
* See listsort.txt for more info.
*/
static Py_ssize_t
merge_compute_minrun(Py_ssize_t n)
/* Return the next minrun value to use. See listsort.txt. */
static inline Py_ssize_t
minrun_next(MergeState *ms)
{
Py_ssize_t r = 0; /* becomes 1 if any 1 bits are shifted off */

assert(n >= 0);
while (n >= MAX_MINRUN) {
r |= n & 1;
n >>= 1;
}
return n + r;
ms->mr_current_frac += ms->mr_frac;
assert(ms->mr_current_frac >> ms->mr_e <= 1);
Py_ssize_t result = ms->mr_int + (ms->mr_current_frac >> ms->mr_e);
ms->mr_current_frac &= ms->mr_mask;
return result;
}

/* Here we define custom comparison functions to optimize for the cases one commonly
Expand Down Expand Up @@ -3074,7 +3075,6 @@ list_sort_impl(PyListObject *self, PyObject *keyfunc, int reverse)
/* March over the array once, left to right, finding natural runs,
* and extending short natural runs to minrun elements.
*/
minrun = merge_compute_minrun(nremaining);
do {
Py_ssize_t n;

Expand All @@ -3083,6 +3083,7 @@ list_sort_impl(PyListObject *self, PyObject *keyfunc, int reverse)
if (n < 0)
goto fail;
/* If short, extend to min(minrun, nremaining). */
minrun = minrun_next(&ms);
if (n < minrun) {
const Py_ssize_t force = nremaining <= minrun ?
nremaining : minrun;
Expand Down
126 changes: 110 additions & 16 deletions Objects/listsort.txt
Original file line number Diff line number Diff line change
Expand Up @@ -270,8 +270,8 @@ result. This has two primary good effects:

Computing minrun
----------------
If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
whole array then; it's hard to beat that given the overheads of trying
If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
whole array then; it's hard to beat that given the overheads of trying
something fancier (see note BINSORT).

When N is a power of 2, testing on random data showed that minrun values of
Expand All @@ -288,7 +288,6 @@ that 32 isn't a good choice for the general case! Consider N=2112:

>>> divmod(2112, 32)
(66, 0)
>>>

If the data is randomly ordered, we're very likely to end up with 66 runs
each of length 32. The first 64 of these trigger a sequence of perfectly
Expand All @@ -301,22 +300,40 @@ to get 64 elements into place).
If we take minrun=33 in this case, then we're very likely to end up with 64
runs each of length 33, and then all merges are perfectly balanced. Better!

What we want to avoid is picking minrun such that in
The original code used a cheap heuristic to pick a minrun that avoided the
very worst cases of imbalance for the final merge, but "pretty bad" cases
still existed.

q, r = divmod(N, minrun)
In 2025, Stefan Pochmann found a much better approach, based on letting minrun
vary a bit from one run to the next. Under his scheme, at _all_ levels of the
merge tree:

q is a power of 2 and r>0 (then the last merge only gets r elements into
place, and r < minrun is small compared to N), or q a little larger than a
power of 2 regardless of r (then we've got a case similar to "2112", again
leaving too little work for the last merge to do).
- The number of runs is a power of 2.
- At most two different run lengths appear.
- When two do appear, the smaller is one less than the larger.
- The lengths of run pairs merged never differ by more than one.

Instead we pick a minrun in range(MAX_MINRUN / 2, MAX_MINRUN + 1) such that
N/minrun is exactly a power of 2, or if that isn't possible, is close to, but
strictly less than, a power of 2. This is easier to do than it may sound:
take the first log2(MAX_MINRUN) bits of N, and add 1 if any of the remaining
bits are set. In fact, that rule covers every case in this section, including
small N and exact powers of 2; merge_compute_minrun() is a deceptively simple
function.
So, in all respects, as perfectly balanced as possible.

For the 2112 case, that also keeps minrun at 33, but we were lucky there
that 2112 is a power of 2 times 33. The new approach doesn't rely on luck.

The basic idea is to conceive of the ideal run length as being a real number
rather than just an integer. For an array of length `n`, let `e` be the
smallest int such that n/2**e < MAX_MINRUN. Then mr = n/2**e is the ideal
run length, and obviously mr * 2**e is n, so there are exactly 2**e runs.

Of course runs can't have a fractional length, so we start the i'th (zero-
based) run at index int(mr * i), for i in range(2**e). The differences between
adjacent starting indices are the run lengths, and it's left as an exercise
for the reader to show that they have the nice properties listed above. See
note MINRUN CODE for an executable Python implementation to help make it all
concrete.

The code doesn't actually compute the starting indices, or use floats. Instead
mr is represented as a pair of integers such that the infinite precision mr is
equal to mr_int + mr_frac / 2**e, and only the delta (run length) from one
index to the next is computed.


The Merge Pattern
Expand Down Expand Up @@ -820,3 +837,80 @@ partially mitigated by pre-scanning the data to determine whether the data is
homogeneous with respect to type. If so, it is sometimes possible to
substitute faster type-specific comparisons for the slower, generic
PyObject_RichCompareBool.

MINRUN CODE
from itertools import accumulate
try:
from itertools import batched
except ImportError:
from itertools import islice
def batched(xs, k):
it = iter(xs)
while chunk := tuple(islice(it, k)):
yield chunk

MAX_MINRUN = 64

def gen_minruns(n):
# mr_int = minrun's integral part
# mr_frac = minrun's fractional part with mr_e bits and
# mask mr_mask
mr_int = n
mr_e = 0
while mr_int >= MAX_MINRUN:
mr_int >>= 1
mr_e += 1
mr_mask = (1 << mr_e) - 1
mr_frac = n & mr_mask

mr_current_frac = 0
while True:
mr_current_frac += mr_frac
assert mr_current_frac >> mr_e <= 1
yield mr_int + (mr_current_frac >> mr_e)
mr_current_frac &= mr_mask

def chew(n, show=False):
if n < 1:
return

sizes = []
tot = 0
for size in gen_minruns(n):
sizes.append(size)
tot += size
if tot >= n:
break
assert tot == n
print(n, len(sizes))

small, large = 32, 64
while len(sizes) > 1:
assert not len(sizes) & 1
assert len(sizes).bit_count() == 1 # i.e., power of 2
assert sum(sizes) == n
assert min(sizes) >= min(n, small)
assert max(sizes) <= large

d = set(sizes)
assert len(d) <= 2
if len(d) == 2:
lo, hi = sorted(d)
assert lo + 1 == hi

mr = n / len(sizes)
for i, s in enumerate(accumulate(sizes, initial=0)):
assert int(mr * i) == s

newsizes = []
for a, b in batched(sizes, 2):
assert abs(a - b) <= 1
newsizes.append(a + b)
sizes = newsizes
smsll = large
large *= 2

assert sizes[0] == n

for n in range(2_000_001):
chew(n)
Loading