Skip to content

gh-135487: fix reprlib.Repr.repr_int when given very large integers #135506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

picnixz
Copy link
Member

@picnixz picnixz commented Jun 14, 2025

Getting the number of digits may be wrong due to some round-off errors but I don't want to overcomplicate the code.

@picnixz picnixz force-pushed the fix/reprlib/value-error-digits-135487 branch from a50f94a to 4ac21bb Compare June 14, 2025 10:57
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can try to reproduce the result of old Python versions, before introducing the limitation on maximal digit count.

If the length exceeds maxlong, repr_int() returns the first i digits followed by fillvalue followed by the last j digits. I afraid that calculating the last j digits has the same complexity as calculating all digits. But for the first i digits it may be faster.

We could also use 10**math.floor(math.log(x)) to get first digits. There are several caveats. Firs, the precision of float is limited, and the larger x, the less significant digits we can get. But it still can be more than 10 digits in most practical cases. We can estimate the error and the number of reliable digits. Second, even with best effort, it can give wrong even the first digit -- it is difficult to differentiate 9999... from 10000.... In this case we should calculate the first i digits, multiply them by 100**(k-i), and then calculate the correction. This can be repeated iteratively, so we can get arbitrary number of first digits.

cc @tim-one

Lib/reprlib.py Outdated
Comment on lines 195 to 197
# When math.log10(abs(x)) is overestimated or underestimated,
# the number of digits should be k - 1 or k + 1 respectively.
# For simplicity, we do not compute the exact number of digits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can differ more than by 1 for large numbers with more than 2**53 digits. Yes, it will take long time (years?) to calculate the decimal digits, but let be more future proof.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I actually didn't consider the possibility of having 2**53 digits...

@MarcellPerger1
Copy link
Contributor

I afraid that calculating the last j digits has the same complexity as calculating all digits

Calculating the last few digits of a number N with n digits is $O(n)$ using some modular arithmetic tricks:
We can find the last j digits of N by doing something like (pseudo-Python) sum((2**i) % (10**j) for i, v in enumerate(N.get_bits()) if v == 1). 1
We can calculate (2**i) % (10**j) more efficiently:

  • First, notice how the last j digits of 2**i repeat after a while (e.g. the last 2 digits of $2^0,2^1,\ldots$ is 1,2,4,8,16,32,64,28,56,12,24,48,96,92,84,68,36,72,44,88,76,52,4,8,16,...).
  • So, ahead of time, we can work out where this cycle loops back round to and then use 2 lookup tables2 to find the last j digits of (2**i) and this will be used as (2**i) % (10**j).
  • This will only involve only a subtraction and a i % (some value) and as mod is O(digits in i), this takes at most $O(\log n)$ time (although this is rather insignificant in practice as any integer where number of digits > $2^{64}$ wouldn't fit into memory so log(n) = log(number of digits) must be < 64)

Therefore, as we need to add $O(n)$ things each taking $O(\log n)$ time to compute, we can find the last few digits of an integer in $O(n\log n)$.
And I'm pretty sure calculating all digits has a worse time complexity than that.

Footnotes

  1. Also, we would have to mod (10**j) after every few addition steps to keep the addition operation constant-time

  2. One for the initial non-repeating section (1,2 in the example above) and another one for the repeating section 4,8,16,32,...,76,52

@picnixz
Copy link
Member Author

picnixz commented Jun 21, 2025

First of all, the "hard" issue only appears when we're considering integers with more than $2^{53}$ digits. We can handle those under that limit by knowing that math.log will be off by 1 (I think the off-by-1 issue only holds for $\log_{10}(N) &lt; 2^{53}$).

So, if we were to enumerate over the bits of N to get the last digits, we will need to iterate over $\log_2(N) \approx 2^{53}$ bits... which is impossible. In particular,

log(n) = log(number of digits)

is incorrect. It is possible to have a number with $2^{48}$ digits and it is also possible to have a number with $2^{64}$ digits in the future.


Some comments to your solution:

sum((2**i) % (10**j) for i, v in enumerate(N.get_bits()) if v == 1)

This is not entirely accurate. What should be more accurate is:

def tail(n, j):
    M = 10 ** j
    bits = reversed(format(n, 'b'))
    res = sum((2 ** i) % M for i, v in bits if v == '1') % M
    return str(res)

There is a small correction to make at the end where you need to apply once again the modulus.

So, ahead of time, we can work out where this cycle loops back round to and then use 2 lookup tables

The lookup table requires $j$ but $j$ depends on maxstring. So we can't build the lookup table in advance (we don't know how large $j$ will be and it would be meaningless to build it for each instance, or hold a global cache as it could grow indefinitely).

More generally, the pattern is $(2^0, 2^1, ..., 2^{(j-1)}, \gamma, \gamma, \cdots)$ with $\gamma$ the cycle. For $j=2$, the cycle has length 20, for $j = 3$, it has length 100 and for $j = 4$ it has length 500 and for $j = 5$ it has length 2500 (it seems to be multiplied by 5 every time, though it's just an observation). It may grow fast very quick! (though we might have a general formula solely depending on $i$ and $j$ but that wouldn't solve the issue with the fact that we need to iterate over the bits of a very large integer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants