I have a csv file concerning fines for wrong parking and it contains month , year , and the reason for the fine. I want to find the top 10 reasons (Error section / main cause) for getting the fines.
Notice that some rows in the
Error section / main cause
0401 Parking Prohibited Area failure to comply with a traffic sign ; 2200 Parking next to the marked parking space
import pandas as pd
from StringIO import StringIO
df = pd.read_csv('Parkingfines.csv', parse_dates=True,
index_col="Month of the error", usecols=["Month of the error",
"Year of the error", "Error section / main cause"],
names=["Month of the error", "Year of the error", "Error section / main cause"], header=0)
df = df['Error section / main cause'].agg(['count'])
df = df['Month of the error'].groupby(df['Year of the error']).agg(['count'])
First of all, your file is a bit corrupted: the following two lines should be merged:
255121 October;; 255122 ;2014;0701 Parking without p-recognized / p-unit / p-ticket
Then your file seem to be Latin-1 encoded. Python 3 assumes by default that all files are UTF-8 and Python 2 assumes they are ASCII, so you have to tell explicitly that your file is Latin-1.
df = pd.read_csv('~/dl/parkingfines-2.csv', sep=';', encoding='latin-1', header=0)
Note also that as David Garwin mentioned, your separator is
, (default), so you have to provide it explicitly. No need other parameters you passed to
pd.read_csv: column names will be obtained from the first line of the file.
Then we have to address an issue that some fines have more than one cause. This can be treated in different ways. For example, we can replace such records with several records (one new record for every cause). This can be done in the following way:
# there are rows without causes, let's drop them df.dropna(inplace=True) # save index to use it later df['idx'] = df.index # find all rows for which cause contains ';' (this means several # causes presented) multiples_mask = df['Error section / main cause'].str.contains(';') multiples = df[multiples_mask] # split each cause with ';' as separator splitted = multiples['Error section / main cause'].str.split(';') # create new dataframe duplicated =  for (i, row), spl in zip(multiples.iterrows(), splitted): for cause in spl: duplicated.append([row['Month of the error'], row['Year of the error'], cause.strip(), i]) # combine the part of dataframe that contains only single causes # with created new dataframe for several causes df_with_dupes = pd.concat( [df[~ multiples_mask], pd.DataFrame(duplicated, columns=df.columns)], ignore_index=True) # sort with idx df_with_dupes.sort_values(by='idx', inplace=True) df = df_with_dupes # drop idx: we do not need it more df.drop('idx', axis=1, inplace=True)
Now we can proceed to your problems. As for your first problem, find the most common causes of fines, the following code works:
causes_counts = df['Error section / main cause'].value_counts() causes_counts.sort_values(ascending=False, inplace=True) print(causes_counts.head(10))
As JohnE mentioned in the comment, you have to use
value_counts() instead of
agg(). Note also that in your code you tried to replace all the dataframe with the result of this command (
df = df['Error section / main cause'].agg(['count']) means replace
df with the result of calculation of the right hand side). It is clear that after you did that you lost your initial dataframe and therefore were not able to access it in the following line. So I used different variable name to store the result of counts.
As for your second problem, the following code works:
counts_per_month = df.groupby(by=['Year of the error', 'Month of the error', ]).agg('count') counts_per_month.index = pd.to_datetime( [' '.join(map(str, col)).strip() for col in counts_per_month.index.values] ) # flatten multiindex and convert to datetime counts_per_month.plot()