Andrej Andrej - 3 months ago 15
Python Question

Count distinct IDs for each year in Python

Suppose we have file like (first column refers to ID number and the second column to year):


Task is to count how many IDs are there for each year. For example in our case the output file should be:


What is the most appropriate programming structure to count these frequencies? My initial attempt is pasted below. (Current code prints only the number of distinct IDs.)

tmp_id = ''
counter = 0
with open('test.txt') as my_file:
for line in my_file:
id, year = line.strip().split('|')
if (id != tmp_id):
counter = counter + 1
tmp_id = id


This is how i would do it:

my_file = r'..\dummy.csv'

my_dict = dict()
with open(my_file, 'r') as f:
    for rows in f:
        v, k = rows.strip().split('|')
        my_dict.setdefault(k, [])

for keys, values in my_dict.items():
    my_dict[keys] = len(set(values))

print(my_dict)  # prints -> {'2013': 1, '2014': 2, '2015': 2}

with open(outfile, 'w') as f_out:
    for k, v in my_dict.items():
        f_out.write(k + '|' + v + '\n')

What's happening is that a dict is created to store the data. The choice for the dict comes from the fact that associations are present and have to be preserved (years to whatever the first column is). This is what dicts are for..

Then we loop through the file and keep adding dict keys (years) in the dictionary if they do not exist using the convenient .setdefault() method. Their values are initialised as empty lists in which the numbers of the first column get appended.

When everything is collected, we go through the dict once more and replace all values (lists) with the lenght of the corresponding set. Note that passing a list to the set() will remove the duplicates present in the list. Finally the len() returns the length of the container.

The last thing we want is to write the results to a file which is trivial work. Just open something with the write flag w and iterate over the dict. Note that by default, the write() method does not add a newline character so you have to do that yourself with '\n'. Cheers.

Having said all that, the code above can probably be substituted by something like two lines if one uses pandas instead. I will let you look into that yourself though.