I have an airline dataset from stat computing http://stat-computing.org/dataexpo/2009/the-data.html which I am trying to analyse.
There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay
My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.
**Totally new to programming and Data Science. Any help would be much appreciated.
Data looks like this. The highlighted columns are the ones I am analysing
enter image description here
Sharing an image of the data is not the same as providing the data for us to work with...
That said I went and grabbed one year of data and worked this up.
flights <- read.csv("~/Downloads/1995.csv", header=T) flights <- flights[, c("DepTime", "ArrDelay")] flights$Dep <- round(flights$DepTime-30, digits = -2) head(flights, n=25) # This tests each hour of departures against the entire day. # Alternative is set to "less" because we want to know if a given hour # has less delay than the day as a whole. pVsDay <- tapply(flights$ArrDelay, flights$Dep, function(x) t.test(x, flights$ArrDelay, alternative = "less")) # This tests each hour of departures against every other hour of the day. # Alternative is set to "less" because we want to know if a given hour # has less delay than the other hours. pAllvsAll <- tapply(flights$ArrDelay, flights$Dep, function(x) tapply(flights$ArrDelay, flights$Dep, function (z) t.test(x, z, alternative = "less")))
I'll let you figure out multiple hypothesis testing and the like.