user3047059 user3047059 - 1 year ago 44
Python Question

Why does Python's float raise ValueError for some very long inputs?

On my Python 2.7.9 on x64 I see the following behavior:

>>> float("10"*(2**28))
>>> float("10"*(2**29))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: 10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010
>>> float("0"*(2**33))
>>> float("0." + "0"*(2**32))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Unless there's some deeper rationale I'm missing this violates least surprise. When I got the ValueError on
I figured it was just a limitation on very long strings, but then
worked. What's going on? Can anyone justify why this behavior isn't a POLA bug (if perhaps a relatively irrelevant one)?


Because the zeros are skipped when inferring the base

I like to look to my favourite reference implementation for questions like this.

The Proof

Casevh has a great intuition in the comments. Here's the relevant code:

for (bits_per_char = -1; n; ++bits_per_char)
    n >>= 1;

/* n <- total # of bits needed, while setting p to end-of-string */
while (_PyLong_DigitValue[Py_CHARMASK(*p)] < base)
*str = p;

/* n <- # of Python digits needed, = ceiling(n/PyLong_SHIFT). */
n = (p - start) * bits_per_char + PyLong_SHIFT - 1;
if (n / bits_per_char < p - start) {
    PyErr_SetString(PyExc_ValueError,"long string too large to convert");
    return NULL;

Where p is initially set to the the pointer to your string. If we look at the PyLongDigitValue table, we see that 0 is explicitly mapped to 0.

Python does a lot of extra work to optimize the conversion of particular bases (there's a fun 200 line comment about converting binary!), that's why it does a lot of work to infer the correct base first. In this case; we can skip over zeros when inferring the base, so they don't count in the overflow calculation.

Indeed, we are checking how many bits are needed to store this float, but python is smart enough to remove leading zeros from this calculation. I don't see anything in the docs of the float function guaranteeing this behaviour across implementations. They ominously state

Convert a string or number to a floating point number, if possible.

When Does this not Work

When you write

   float("0." + "0"*(2**32))

It stops parsing for the base early on - all the rest of the zeros are considered in the bit-length calculation, and contribute to raising the ValueError

Similar Parsing Tricks

Here's a similar case in the float class, where we find that whitespace is ignored (and an interesting comment from the author's on there intent with this design choice)

while (Py_ISSPACE(*s))    

/* We don't care about overflow or underflow.  If the platform
 * supports them, infinities and signed zeroes (on underflow) are    
 * fine. */