Nina Nina - 26 days ago 4
Python Question

Counting with pandas (two different numbers at the same row)

I have a csv file concerning fines for wrong parking and it contains month , year , and the reason for the fine. I want to find the top 10 reasons (Error section / main cause) for getting the fines.

Notice that some rows in the

Error section / main cause
column some rows have two different reasons for getting the fine like (
0401 Parking Prohibited Area failure to comply with a traffic sign ; 2200 Parking next to the marked parking space

)

And then draw a graph of the number of fines per month (from January 2014 to latest data)
The code takes long time to respond and then gives error.(long list)

import pandas as pd
from StringIO import StringIO

df = pd.read_csv('Parkingfines.csv', parse_dates=True,
index_col="Month of the error", usecols=["Month of the error",
"Year of the error", "Error section / main cause"],
names=["Month of the error", "Year of the error", "Error section / main cause"], header=0)

df = df['Error section / main cause'].agg(['count'])

df = df['Month of the error'].groupby(df['Year of the error']).agg(['count'])

df.head(10).plot()


Link for the csv file

Answer

First of all, your file is a bit corrupted: the following two lines should be merged:

255121 October;;
255122 ;2014;0701 Parking without p-recognized / p-unit / p-ticket

Then your file seem to be Latin-1 encoded. Python 3 assumes by default that all files are UTF-8 and Python 2 assumes they are ASCII, so you have to tell explicitly that your file is Latin-1.

df = pd.read_csv('~/dl/parkingfines-2.csv', sep=';', 
                 encoding='latin-1', header=0)

Note also that as David Garwin mentioned, your separator is ;, not , (default), so you have to provide it explicitly. No need other parameters you passed to pd.read_csv: column names will be obtained from the first line of the file.

Then we have to address an issue that some fines have more than one cause. This can be treated in different ways. For example, we can replace such records with several records (one new record for every cause). This can be done in the following way:

# there are rows without causes, let's drop them
df.dropna(inplace=True)

# save index to use it later
df['idx'] = df.index

# find all rows for which cause contains ';' (this means several 
# causes presented)
multiples_mask = df['Error section / main cause'].str.contains(';')
multiples = df[multiples_mask]

# split each cause with ';' as separator
splitted = multiples['Error section / main cause'].str.split(';')

# create new dataframe
duplicated = []
for (i, row), spl in zip(multiples.iterrows(), splitted):
    for cause in spl:
        duplicated.append([row['Month of the error'], 
                           row['Year of the error'],
                           cause.strip(), i])

# combine the part of dataframe that contains only single causes
# with created new dataframe for several causes
df_with_dupes = pd.concat(
    [df[~ multiples_mask],
     pd.DataFrame(duplicated, columns=df.columns)], ignore_index=True)

# sort with idx
df_with_dupes.sort_values(by='idx', inplace=True)
df = df_with_dupes

# drop idx: we do not need it more
df.drop('idx', axis=1, inplace=True)

Now we can proceed to your problems. As for your first problem, find the most common causes of fines, the following code works:

causes_counts = df['Error section / main cause'].value_counts()
causes_counts.sort_values(ascending=False, inplace=True)
print(causes_counts.head(10))

As JohnE mentioned in the comment, you have to use value_counts() instead of agg(). Note also that in your code you tried to replace all the dataframe with the result of this command (df = df['Error section / main cause'].agg(['count']) means replace df with the result of calculation of the right hand side). It is clear that after you did that you lost your initial dataframe and therefore were not able to access it in the following line. So I used different variable name to store the result of counts.

As for your second problem, the following code works:

counts_per_month = df.groupby(by=['Year of the error', 
                                  'Month of the error', ]).agg('count')
counts_per_month.index = pd.to_datetime(
    [' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
)
# flatten multiindex and convert to datetime

counts_per_month.plot()

counts