Bhishan Poudel Bhishan Poudel - 5 months ago 43
Python Question

Efficient way to extract some lines from a large file with some conditions in python

I have seen lots of similar questions regarding this questions, but none of them helped me.

I have a large data file with contents like this:
The data can be obtained from:

http://pastebin.com/KCW9phzX

# data can be obtained from pastebin
# large_data.csv
# lots of comments
# number 1 1 err
# wave flux err
0.807172 7.61973e-11 1.18177e-13
0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13
0.808185 7.63852e-11 1.19324e-13
0.808387 7.60547e-11 1.18952e-13
0.80859 7.52287e-11 1.18016e-13
0.808792 7.53114e-11 1.18979e-13
0.808995 7.58247e-11 1.20198e-13
# lots of other lines


I want to extract range for the zeroth column from, lets say, when wavelenth is >= to 0.807375 and less than or <= 0.807982 .
So far I have done this:

# extract range from a file
import numpy as np
col0,col1,col2 = np.genfromtxt('large_data.csv',usecols=(0,1,2),delimiter='',dtype=float,unpack=True)
col0ex = [x for x in col0 if (x>=0.807375) if (x<= 0.807982)]
print(col0ex)


Here, i read all the wavelength data from the text file and then using conditions i extracted the data. Certainly this is not the most efficient way. Also according to this link:

http://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#=

use of numpy.genfromtxt or numpy.loadtxt uses lots of memory.
What could be the better way of doing this?

Moreover, i also want both column0 and column1 in the output file.

The required output file is:

#fileout.csv
0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13


Here, for simplicity i choose line2 to line4 of data, please do not post the answer for extracting line 2 to line4. This is just an example I need some conditions such as extract all the rows if first column value (wavelength) is between 1.57 and 2.0, or so on.

How can we do this?

Any help will be appreciated!

Very similar links are following:

http://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#=

efficient way to extract few lines of data from a large csv data file in python

What is the most efficient way to match list items to lines in a large file in Python?

Extract specific lines from file and create sections of data in python

how to extract elements from a list in python?

How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?

genfromtxt and numpy

Answer

You could call np.genfromtxt(f, max_rows=chunksize) in a loop to read the file in chunks. This way you can retain the convenience and speed of NumPy arrays while controlling the amount of memory required by adjusting chunksize.

import numpy as np
import warnings
# genfromtxt warns if it encounters an empty file. Let's silence this warnings since 
# the code below handles it.
warnings.filterwarnings("ignore", message='genfromtxt', category=UserWarning)

# This reads 2 lines at a time
chunksize = 2
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
    while True:
        arr = np.genfromtxt(fin, max_rows=chunksize, usecols=(0,1,2), 
                            delimiter='', dtype=float)
        if not arr.any(): break
        arr = np.atleast_2d(arr)
        mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
        arr = arr[mask]

        # uncomment this print statement to confirm the file is being read in chunks
        # print('{}\n{}'.format(arr, '-'*80))
        np.savetxt(fout, arr, fmt='%g')

writes to out.csv:

0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13

For a large data file you'll of course want to increase chunksize to some integer much larger than 2. Generally you'll get the the best performance by choosing chunksize to be as large as possible while still operating on arrays that fit in RAM.


The code above is intended for large files. For a file with only 7000 lines,

import numpy as np
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
    arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
    mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
    arr = arr[mask]
    np.savetxt(fout, arr, fmt='%g')

suffices.

Comments