Maximus Maximus - 1 month ago 9
Javascript Question

What is the minimum number I need to add to get Infinity for 1 byte floating point

I'm trying to understand what minimum number I need to add to get

Infinity
because of overflow. I've read this answer already. So let me just clarify my understanding here. To simplify, I'll be working with 1 byte floating point with 4 bits for exponent and 3 bits for mantissa:

0 0000 000


The maximum positive number I can store in it is this:

0 1110 111


which is when converted to scientific notation:

1.111 x 2^{7} = 11110000


Is my understanding correct that the minimum number I should add to get
Infinity
is
00010000
:

11110000
+ 00010000
--------
1 00000000


As I understand anything less than
00010000
will not cause overflow and the result will be rounded to
11110000
. But the
00010000
is
0 0000 001
in floating point format, and it's the number
1
. So is adding just
1
enough to cause overflow?

Answer

The answer is given in the other answer to the question you link to. The smallest value which will round to infinity is:

c = 27 × ( 2 − ½ × 21-4 ) = 1.9375 × 27 = 1.11112 × 27

So the smallest value that you can add to get infinity is

c - fmax = 1.11112 × 27 - 1.1112 × 27 = 0.00012 × 27 = 23

which, if I understand correctly, would have bit pattern 0 1010 000 in your proposed format.

UPDATE: so why is it this particular cutoff?

Suppose that there was another binade above this one, then the next floating point number would be

x = 1.0002 × 28

Note that c is the value that is exactly halfway between x and fmax. In other words, the values which would round up to x are instead rounded to infinity, but the values which would round down to fmax still round to the same value.