thomleo thomleo - 2 months ago 16
Python Question

Working with floating point numbers for comparison and related operations in NumPy

I have an array of random floats and I need to compare it to another one that has the same values in a different order. For that matter I use the sum, product (and other combinations depending on the dimension of the table hence the number of equations needed).

Nevertheless, I encountered a precision issue when I perform the sum (or product) on the array depending on the order of the values.

Here is a simple standalone example to illustrate this issue :

import numpy as np

n = 10
m = 4

tag = np.random.rand(n, m)

s1 = np.sum(tag, axis=1)
s2 = np.sum(tag[:, ::-1], axis=1)

# print the number of times s1 is not equal to s2 (should be 0)
print np.nonzero(s1 != s2)[0].shape[0]


If you execute this code it sometimes tells you that
s1
and
s2
are not equal and the differents is of magnitude of the computer precision.

The problem is I need to use those in functions like
np.in1d
where I can't really give a tolerance...

Is there a way to avoid this issue?

Answer

For the listed code, you can use np.isclose and with it tolerance values could be specified too.

Using the provided sample, let's see how it could be used -

In [201]: n = 10
     ...: m = 4
     ...: 
     ...: tag = np.random.rand(n, m)
     ...: 
     ...: s1 = np.sum(tag, axis=1)
     ...: s2 = np.sum(tag[:, ::-1], axis=1)
     ...: 

In [202]: np.nonzero(s1 != s2)[0].shape[0]
Out[202]: 4

In [203]: (~np.isclose(s1,s2)).sum() # So, all matches!
Out[203]: 0

To make use of tolerance values in other scenarios, we need to work on a case-by-case basis. So, let's say for an implementation that involve elementwise comparison like in np.in1d, we can bring in broadcasting to do those elementwise equality checks for all elems in first input against all elems in the second one. Then, we use np.abs to get the "closeness factor" and finally compare against the input tolerance to decide the matches. As needed to simulate np.in1d, we do ANY operation along one of the axis. Thus, np.in1d with tolerance using broadcasting could be implemented like so -

def in1d_with_tolerance(A,B,tol=1e-05):
    return (np.abs(A[:,None] - B) < tol).any(1)

As suggested in the comments by OP, we can also round floating-pt numbers after scaling them up and this should be memory efficient, as being needed for working with large arrays. So, a modified version would be like so -

def in1d_with_tolerance_v2(A,B,tol=1e-05):
    S = round(1/tol)
    return np.in1d(np.around(A*S).astype(int),np.around(B*S).astype(int))

Sample run -

In [372]: A = np.random.rand(5)
     ...: B = np.random.rand(7)
     ...: B[3] = A[1] + 0.0000008
     ...: B[6] = A[4] - 0.0000007
     ...: 

In [373]: np.in1d(A,B) # Not the result we want!
Out[373]: array([False, False, False, False, False], dtype=bool)

In [374]: in1d_with_tolerance(A,B)
Out[374]: array([False,  True, False, False,  True], dtype=bool)

In [375]: in1d_with_tolerance_v2(A,B)
Out[375]: array([False,  True, False, False,  True], dtype=bool)

Finally, on how to make it work for other implementations and use cases - It would depend on the implementation itself. But for most cases, np.isclose and broadcasting should help.