norpa norpa - 1 year ago 86
Python Question

Counting repeated (duplicated) in a list using dictionary

I'm writing a program to identify repeated values and their count in a particular column (named 'StrId') in an Excel spreadsheet. Besides finding repetitions, I need to know how many times each value is repeated.

The Excel data was processed as a list of dictionaries (one dictionary per row) with headers as keys and data as values, like [{'StrId' : 1, 'ProjId' : 358}][{'StrId' : 2, 'ProjId' : 984...}] etc.

My plan was to first identify the 'StrId' keys in each dictionary, put them in a list, and then create another dictionary within that list to pass values and separate when there is more than 1 value, counting those that show up more than once.

Here is my code. Right now, it shows a 'KeyError' message with the fist value, and stops.

I'd appreciate any help. Thanks in advance

from openpyxl import load_workbook
workbook = load_workbook('./fullallreadyconversionxmlclean4.xlsx')
sheet = workbook['Full-All']
headers = ["StrId", "ProjectId", "TweetText", "Label"]

excel_data = []
for row_num, row in enumerate(sheet):
if row_num is 0:
row_data = {}
for col_num, cell in enumerate(row):
if col_num > len(headers) - 1:
key = headers[col_num]
value = cell.value
row_data[key] = value

for row in excel_data:
for key in row:
if key is 'StrId':
value = row[key]
list_ids = []

dup_dic = {}
for value in list_ids:
if value in list_ids:
dup_dic[value] +=1
dup_dic[value] =1

print dup_dic

Answer Source

If the sublists can contain more than one dict you can flatten the sublists with itertools.chain :

from collections import Counter
excel_data = [
    [{'StrId': 1, 'ProjId': 358},{'StrId': 5, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984},{'StrId': 3, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],

from collections import Counter
from itertools import chain
print(Counter(map(itemgetter("StrId"), chain(*excel_data))))

But you seem to have a list of dicts so you can remove the chain:

from collections import Counter

print(Counter(map(itemgetter("StrId"), excel_data)))

Never use if is when comparing strings, is checks the identity of and object, use == i.e if key == 'StrId' but it would make a lot more sense to just do a lookup i.e value = row["StrId"]. Also give you variables better names, row is not a very good name for a dict.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download