isosceleswheel - 1 year ago 332
Python Question

Memory efficient sort of massive numpy array in Python

I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions =

`(868940742, 3)`
which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.

I build this VERY large array step by step from 22 smaller
`(N, 2)`
subarrays.

Function
`FUN_1`
generates 2 new
`(N, 1)`
arrays using each of the 22 subarrays which I call
`sub_arr`
.

The first output of
`FUN_1`
is generated by interpolating values from
`sub_arr[:,0]`
on array
`b = array([X, F(X)])`
and the second output is generated by placing
`sub_arr[:, 0]`
into bins using array
`r = array([X, BIN(X)])`
. I call these outputs
`b_arr`
and
`rate_arr`
, respectively. The function returns a 3-tuple of
`(N, 1)`
arrays:

``````import numpy as np

def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""

b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0])  # HUGE efficiency gain over np.digitize...

return r[rate_r, 1], b_arr, sub_arr[:,1]
``````

I call the function 22 times in a for-loop and fill a pre-allocated array of zeros
`full_arr = numpy.zeros([868940742, 3])`
with the values:

``````full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
``````

In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.

Here is the sorting routine (there are two consecutive sorts)

``````for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
``````

Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of
`[X, F(X)]`
values for the interpolation step above in
`FUN_1`
that returns
`b_arr`
and now the SORT really slows down, although everything else remains the same.

Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?

bigger, slower file:

``````17399307    99.4
17493652    98.8
17570460    98.2
17575180    97.6
17577127    97
17578255    96.4
17580576    95.8
17583028    95.2
17583699    94.6
17584172    94
``````

smaller, more uniform regular file:

``````1       24
1001    24
2001    24
3001    24
4001    24
5001    24
6001    24
7001    24
``````

I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!

At the moment each call to `np.argsort` is generating a `(868940742, 1)` array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of `full_arr` you are generating another `(868940742, 1)` array of floats, since fancy indexing always returns a copy rather than a view.

One fairly obvious improvement would be to sort `full_arr` in place using its `.sort()` method. Unfortunately, `.sort()` does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a `view` onto your array as a structured array with three float fields, then sorting by one of these fields:

``````full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
``````

In this case I'm sorting `full_arr` in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns (`'f8'`) - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. `full_arr.flags.C_CONTIGUOUS == True`.

Credit for this method should go to Joe Kington for his answer here.

Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using `np.argsort` to generate an index array, as you mentioned in the comments below (see this previous question). If you use `np.argsort` to obtain a set of indices to sort by, you might see a modest performance gain by using `np.take` rather than direct indexing to get the sorted array:

`````` %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop

%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
``````

However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.

Regarding your question about why sorting the second array is faster - yes, you should expect sorting to be faster when there are fewer unique values in the array.

By default, `np.ndarray.sort()` uses Quicksort. The `qsort` variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.

For example:

``````%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
``````

In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download