Darryl Dan Darryl Dan - 1 month ago 9
Python Question

grouping related data in csv excel file

this is a csv excel file

Receipt Name Address Date Time Items
25007 A ABC pte ltd 3/7/2016 10:40 Cheese, Cookie, Pie
25008 A ABC ptd ltd 3/7/2016 11.30 Cheese, Cookie
25009 B CCC ptd ltd 4/7/2016 07.35 Chocolate
25010 A ABC pte ltd 4/7/2016 12:40 Butter, Cookie


how do i retrieve the dates and time and group them to respectively company A and B such that the output would be something like: (A, 3/7/2016, 10:40, 11.30, 4/7/2016 12:40), (B, 4/7/2016, 07:35)

My existing code is:

datePattern = re.compile(r"(\d+/\d+/\d+)\s+(\d+:\d+)")
dateDict =dict()

for i, line in enumerate(open('sample_data.csv')):
for match in re.finditer(datePattern,line):
if match.group(1) in dateDict:
dateDict[match.group(1)].append(match.group(2))
else:
dateDict[match.group(1)] = [match.group(2),]


However it only works for grouping date and time but now i want to include name as part of the grouping as well. *Using csv module would be preferred

Answer

Presuming your data actually looks like:

Receipt,Name,Address,Date,Time,Items
25007,A,ABC pte ltd,4/7/2016,10:40,"Cheese, Cookie, Pie"
25008,A,CCC pte ltd,4/7/2016,11:30,"Cheese, Cookie"
25009,B,CCC pte ltd,4/7/2016,07:35,"Chocolate"
25010,A,CCC pte ltd,4/7/2016,12:40," Butter, Cookie"

then it is pretty trivial to group:

from collections import defaultdict
from csv import reader
with open("test.csv") as f:
    next(f) # skip header
    group_dict = defaultdict(list)
    for _, name, _, dte, time, _ in reader(f):
        group_dict[name].append((dte, time))

from  pprint import pprint as pp

pp(dict(group_dict))

which would give you:

'A': [('4/7/2016', '10:40'), ('4/7/2016', '11:30'), ('4/7/2016', '12:40')],
 'B': [('4/7/2016', '07:35')]}

If you don't want the date repeating, then also group on that:

with open("test.csv") as f:
    next(f) # skip header
    group_dict = defaultdict(list)
    for _, name, _, dte, time, _ in reader(f):
        group_dict[name, dte].append(time)

from  pprint import pprint as pp

pp(dict(group_dict))

Which would give you:

{('A', '4/7/2016'): ['10:40', '11:30', '12:40'], ('B', '4/7/2016'): ['07:35']}
Comments