I am still working on my New York Subway data. I cleaned and wrangled the data in such a fashion that I now have 'Average Entries' and 'Average Exits' per Station per hour (ranging from 0 to 23) separated for weekend and weekday (category variable with two possible values: weekend/weekday).
What I was trying to do is to create a plot with each station being a row, each row having two columns (first for weekday, second for weekend). I would like to plot 'Average Entries' and 'Average Exits' per hour to gain some information about the stations. There are two things of interest here; firstly the sheer numbers to indicate how busy a station is; secondly the ratio between entries and exits for a given hour to indicate if the station is a living area (loads of entries in the morning, loads of exits in the evening) or more of a working area (loads of exits in the morning, entries peeking around 4, 6 and 8 pm or so). Only problem, there are roughly 550 stations.
I tried plotting it with seaborn facetgrid, which cant handle more than a few stations (10 or so) without running into memory issues.
So I was wondering if anybody had a good idea to accomplish what I am trying to do.
Please find attached a notebook (second to last cell shows my attempt of visualizing the data, i.e. the plotting for 4 stations). That clearly wouldn't work for 500+ stations, so maybe 5 stations in a row after all?
The very last cell contains the data for Station R001 as requested in a comment..
Any input much appreciated!
A possible way you could do it is to use the ratio of entries to exits per station. Each day/hour could form a column on an image and each row would be a station. As en example:
from matplotlib import pyplot as plt import random import numpy as np all_stations =  for i in range(550): entries = [float(random.randint(0, 50)) for i in range(7*24)] # Data point for each hour over a week exits = [float(random.randint(0, 50)) for i in range(7*24)] weekend_entries = entries[:2*7] weekend_exits = exits[:2*7] day_entries = entries[2*7:] day_exits = exits[2*7:] weekend_ratio = [np.array(en) / np.array(ex) for en, ex in zip(weekend_entries, weekend_exits)] day_ratio = [np.array(en) / np.array(ex) for en, ex in zip(day_entries, day_exits)] whole_week = weekend_ratio + day_ratio all_stations.append(whole_week) plt.figure() plt.imshow(all_stations, aspect='auto', interpolation="nearest") plt.xlabel("Hours") plt.ylabel("Station number") plt.title("Entry/exit ratio per station") plt.colorbar(label="Entry/exit ratio") # Add some vertical lines to indicate days for j in range(1, 7): plt.plot([j*24]*2, [0, 550], color="black") plt.xlim(0, 7*24) plt.ylim(0, 550) plt.show()
If you would like to show the actual numbers involved an not the ratio, I would consider splitting the data into two, one image for each of the entries and exit data sets. The intensity of each pixel could then be used to inform on the numbers, not ratio.