flyingrose flyingrose - 23 days ago 6
Python Question

How to get an exact one of python iterator?

I was reading a super big csv file(10G) using pandas, and read_csv(filename, chunksize=chunksize) return me an iterator (assum it names 'reader'). And now I want to get an exact chunk because I just want a certain few of lines(for example, the csv file I read has 1000000000 lines, and I want to get number 50000000 line and 1000 lines after it), what should I do except tranverse the iterator until it reaches the chunk I want?

Here is my former code:

def get_lines_by_chunk(file_name, line_beg, line_end, chunk_size=-1):
func_name = 'get_lines_by_chunk'
line_no = get_file_line_no(file_name)

if chunk_size < 0:
chunk_size = get_chunk_size(line_no, line_beg, line_end)

reader = pd.read_csv(file_name, chunksize=chunk_size)
data = pd.DataFrame({})

flag = 0

for chunk in reader:
line_before = flag * chunk_size
flag = flag + 1
line_after = flag * chunk_size
if line_beg >= line_before and line_beg <= line_after:
if line_end >= line_after:
temp = chunk[line_beg - line_before : chunk_size]
data = pd.concat([data, temp], ignore_index=True)
else:
temp = chunk[line_beg - line_before : line_end - line_before]
data = pd.concat([data, temp], ignore_index=True)
return data
elif line_end <= line_after and line_end >= line_before:
temp = chunk[0 : line_end - line_before]
data = pd.concat([data, temp], ignore_index=True)
return data
elif line_beg < line_before and line_end > line_after:
temp = chunk[0 : chunk_size]
data = pd.concat([data, temp], ignore_index=True)

return data

Answer

If you need to read your CSV file with differently-sized chunks you can use iterator=True:

Assuming we have a 1000rows DF (see in setup section for how it's been generated)

In [103]: reader = pd.read_csv(fn, iterator=True)

In [104]: reader.get_chunk(5)
Out[104]:
   a   b
0  1   8
1  2  28
2  3  85
3  4  56
4  5  29

In [105]: reader.get_chunk(3)
Out[105]:
   a   b
5  6  55
6  7  16
7  8  96

NOTE: get_chunk can't skip data, it will continuously read data with specified chunk sizes

if you want to read only the rows 100 - 110:

In [106]: cols = pd.read_csv(fn, nrows=1).columns.tolist()

In [107]: cols
Out[107]: ['a', 'b']

In [109]: pd.read_csv(fn, header=None, skiprows=100, nrows=10, names=cols)
Out[109]:
     a   b
0  100  52
1  101  15
2  102  74
3  103  10
4  104  35
5  105  73
6  106  48
7  107  49
8  108   1
9  109  56

But if you can use HDF5 format - it will be much easier and faster:

let's save it as HDF5 first :

In [110]: df.to_hdf('c:/temp/test.h5', 'mydf', format='t', data_columns=True, compression='blosc', complevel=9)

now we can read it by index positions as follows:

In [113]: pd.read_hdf('c:/temp/test.h5', 'mydf', start=99, stop=109)
Out[113]:
       a   b
99   100  52
100  101  15
101  102  74
102  103  10
103  104  35
104  105  73
105  106  48
106  107  49
107  108   1
108  109  56

or querying (SQL like):

In [115]: pd.read_hdf('c:/temp/test.h5', 'mydf', where="a >= 100 and a <= 110")
Out[115]:
       a   b
99   100  52
100  101  15
101  102  74
102  103  10
103  104  35
104  105  73
105  106  48
106  107  49
107  108   1
108  109  56
109  110  23

Setup:

In [99]: df = pd.DataFrame({'a':np.arange(1, 1001), 'b':np.random.randint(0, 100, 1000)})

In [100]: fn = r'C:\Temp\test.csv'

In [101]: df.to_csv(fn, index=False)

In [102]: df.shape
Out[102]: (1000, 2)
Comments