duckman duckman - 1 month ago 9
Python Question

large data transformation in python

I have a large data set (ten 12gb csv files) that have 25 columns and would want to transform it to a dataset with 6 columns. the first 3 columns remains the same whereas the 4th one would be the variable names and the rest contains data. Below is my input:

#RIC Date[L] Time[L] Type L1-BidPrice L1-BidSize L1-AskPrice L1-AskSize L2-BidPrice L2-BidSize L2-AskPrice L2-AskSize L3-BidPrice L3-BidSize L3-AskPrice L3-AskSize L4-BidPrice L4-BidSize L4-AskPrice L4-AskSize L5-BidPrice L5-BidSize L5-AskPrice L5-AskSize
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 44000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 38000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000


and I would transform it to:

#RIC Date[L] Time[L] level Bid_price bid_volume Ask_price Ask_volume
HOU.ALP 20150901 30:10.8 L1 5.29 50000 5.3 50000
HOU.ALP 20150901 30:10.8 L2 5.28 50000 5.31 50000
HOU.ALP 20150901 30:12.1 L3 5.27 50000 5.32 50000
HOU.ALP 20150901 30:12.1 L4 5.26 50000 5.33 50000
HOU.ALP 20150901 30:12.1 L5
HOU.ALP 20150901 30:12.1 L1 5.29 50000 5.3 50000
HOU.ALP 20150901 30:12.1 L2 5.28 44000 5.31 50000
HOU.ALP 20150901 30:12.1 L3 5.27 48000 5.32 50000
HOU.ALP 20150901 30:12.1 L4 5.26 50000 5.33 50000


Here is my attempt with the coding. I think I would have to use dictionary to write to a csv file

def depth_data_transformation(input_file_list, output_file):

for file in input_file_list:
file_to_open = '%s.csv' %file
with open(file_to_open) as f, open(output_file, "w") as out:
next(f) # skip header
cols = ["#RIC", "Date[L]", "Time[L]", "level", "Bid_price", "bid_volume", "Ask_price", "Ask_volume"]
wr = csv.writer(out)
wr.writerow(cols)
for row in csv.reader(f):
# get all but first three cols
it = row[4:]
# zip_longest(*[iter(it)] * 4, fillvalue="") -> group into 4's, add empty string for missing values
for ind, t in enumerate(izip_longest(*[iter(it)] * 4, fillvalue=""), 1):
# first 3 cols, level and group all in one row/list.
wr.writerow(row[:3]+ ["l{}".format(ind)] + list(t))

Answer

You need to group the levels, i.e L1-BidPrice L1-BidSize L1-AskPrice L1-AskSize and write each to a new row :

import csv  
from itertools import zip_longest # izip_longest python2


with open("infile.csv") as f, open("out.csv", "w") as out:
    next(f) # skip header
    cols = ["#RIC", "Date[L]", "Time[L]", "level", "Bid_price", "bid_volume", "Ask_price", "Ask_volume"]
    wr = csv.writer(out)
    wr.writerow(cols)
    for row in csv.reader(f):
        # get all but first three cols
        it = row[4:]
        # izip_longest(*[iter(it)] * 4, fillvalue="") -> group into 4's, add empty string fotr missing values
        for ind, t in enumerate(izip_longest(*[iter(it)] * 4, fillvalue=""), 1):
           # first 3 cols, level and group all in one row/list.
            wr.writerow(row[:3]+ ["l{}".format(ind)] + list(t))

Which would give you:

#RIC,Date[L],Time[L],level,Bid_price,bid_volume,Ask_price,Ask_volume
HOU.ALP,20150901,30:10.8,l1,5.29,50000,5.3,32000
HOU.ALP,20150901,30:10.8,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:10.8,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:10.8,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:10.8,l5,5.34,50000,,
HOU.ALP,20150901,30:10.8,l1,5.29,50000,5.3,44000
HOU.ALP,20150901,30:10.8,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:10.8,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:10.8,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:10.8,l5,5.34,50000,,
HOU.ALP,20150901,30:12.1,l1,5.29,50000,5.3,32000
HOU.ALP,20150901,30:12.1,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:12.1,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:12.1,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:12.1,l5,5.34,50000,,
HOU.ALP,20150901,30:12.1,l1,5.29,50000,5.3,38000
HOU.ALP,20150901,30:12.1,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:12.1,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:12.1,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:12.1,l5,5.34,50000,,

You have HOU.ALP 20150901 30:10.8 L1 5.29 50000 5.3 50000 in your expected output but 32000 is the value in your input for L1-AskSize, each row has 5 levels and you also have 8 columns so I presume your expected output is wrong.

Comments