Dominik Dominik - 1 month ago 7
R Question

out of order date in ggplot2

I typically know how to order my dates in ggplot but something is different about this data and I'm hoping someone can clarify for me.

Consider:

ggplot(tmp3)+
geom_boxplot(aes(x=simdte,y=r2))+
facet_wrap(~simyr, scales='free_x')+
theme(axis.text.x=element_text(angle=45,hjust=1))


The dates are in alphanumeric order but now I want to format the x axis labels so I tried:

ggplot(tmp3)+
geom_boxplot(aes(x=reorder(strftime(strptime(simdte,'%Y%m%d'),'%b-%d'),as.numeric(simdte)),y=r2))+
facet_wrap(~simyr, scales='free_x')+
theme(axis.text.x=element_text(angle=45,hjust=1))


but notice that all the dates are in order EXCEPT Jun-08 in 2015.

I also tried

tmp3=
tmp3 %>%
mutate(plotsimdte=factor(strftime(strptime(simdte,'%Y%m%d'),'%b-%d'), levels=strftime(strptime(unique(simdte),'%Y%m%d'),'%b-%d')[order(unique(simdte))]))


and plotting with
x=plotsimdte
but no difference. I get a warning when I create this factor about duplicated levels which is confusing since I'm only using unique values.

Lastly, I tried

ggplot(tmp3)+
geom_boxplot(aes(x=as.Date(simdte,'%Y%m%d'),y=r2, group=simdte))+
scale_x_date(date_labels ='%b-%d')+
facet_wrap(~simyr, scales='free_x')+
theme(axis.text.x=element_text(angle=45,hjust=1))


but I'd like to keep the dates discrete because their importance is as an identifier rather than distribution through time.

Any advice would be appreciated. Thanks

A small subset of my data

EDIT: updated dput output with as.data.frame

> dput(as.data.frame(tmp3))
structure(list(mdldte = c("20130525", "20140407", "20140413",
"20150608", "20130525", "20150608", "20140420", "20130429", "20130608",
"20130608", "20140323", "20140413", "20150325", "20150608", "20140511",
"20130601", "20150608", "20130608", "20140420", "20150305", "20150415",
"20130608", "20140531", "20150608", "20140531", "20150608", "20130403",
"20130503", "20150415", "20140407", "20150608", "20140323", "20130525",
"20140420", "20130403", "20130403", "20130608", "20150501", "20150608",
"20130429", "20160607", "20140527", "20140420", "20140531", "20140502",
"20150325", "20140428", "20160620", "20160620", "20130403", "20160527",
"20150415", "20140413", "20160607", "20140413", "20150608", "20160613",
"20150608", "20140407", "20150501", "20140323", "20160607", "20140531",
"20150305", "20150409", "20140428", "20130503", "20130525", "20140428",
"20140407", "20130503", "20130525", "20130403", "20150305", "20150217",
"20150501", "20130608", "20150305", "20150217", "20130608", "20140511",
"20160527", "20140502", "20150415"), simdte = c("20130403", "20130403",
"20130403", "20130429", "20130429", "20130429", "20130503", "20130503",
"20130503", "20130525", "20130525", "20130525", "20130601", "20130601",
"20130601", "20130608", "20130608", "20130608", "20140323", "20140323",
"20140323", "20140407", "20140407", "20140407", "20140413", "20140413",
"20140413", "20140420", "20140420", "20140420", "20140428", "20140428",
"20140428", "20140502", "20140502", "20140502", "20140511", "20140511",
"20140511", "20140517", "20140517", "20140517", "20140527", "20140527",
"20140527", "20140531", "20140531", "20140531", "20150217", "20150217",
"20150217", "20150305", "20150305", "20150305", "20150325", "20150325",
"20150325", "20150409", "20150409", "20150409", "20150415", "20150415",
"20150415", "20150427", "20150427", "20150427", "20150501", "20150501",
"20150501", "20150608", "20150608", "20150608", "20160527", "20160527",
"20160527", "20160607", "20160607", "20160607", "20160613", "20160613",
"20160613", "20160620", "20160620", "20160620"), r2 = c(0.862283742909527,
0.813142444594872, 0.700946018367384, 0.474388980021752, 0.826648311592866,
0.794283339648572, 0.79687922855493, 0.808984929407683, 0.781751354268809,
0.535951689307516, 0.68524477567256, 0.716321630808227, 0.373141090466726,
0.723850452026657, 0.408972539926536, 0.29346057127035, 0.319261073048776,
0.319535158994707, 0.872351278607699, 0.871652058666136, 0.509872096326808,
0.398605136979609, 0.420745998256184, 0.596082529689281, 0.793035779455997,
0.661212720614186, 0.736581215438551, 0.89337362408349, 0.900773593767951,
0.916946297262156, 0.700865150846107, 0.839501961957186, 0.863684601286204,
0.819367869015135, 0.765192251153536, 0.590744027549224, 0.720092636591613,
0.732237645665246, 0.701898569000057, 0.505310296599101, 0.756344530560126,
0.522404606955389, 0.631453896947287, 0.732767696833121, 0.669168785479052,
0.340080390313005, 0.397681954572616, 0.708286400101956, 0.551718623201008,
0.62217661847446, 0.160935876745664, 0.79407487647674, 0.729924604817696,
0.716024523586796, 0.526169199415047, 0.702098331814224, 0.748626603557805,
0.432690018453805, 0.710646849035047, 0.526049259906931, 0.811336120223548,
0.679819505156441, 0.591396577448379, 0.656686513355743, 0.698313842140892,
0.718604690738853, 0.768070041705958, 0.453336001102217, 0.544446423520199,
0.583336140040845, 0.172961846412558, 0.298155303932666, 0.731010397306203,
0.582517045429492, 0.521708072638302, 0.610885761462162, 0.543494236386099,
0.630580819311437, 0.642714888852003, 0.736302041771047, 0.736086951074143,
0.444437396681972, 0.445336147280364, 0.43829690520584), simyr = c("2013",
"2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013",
"2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013",
"2013", "2014", "2014", "2014", "2014", "2014", "2014", "2014",
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014",
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014",
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2015",
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015",
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015",
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2016",
"2016", "2016", "2016", "2016", "2016", "2016", "2016", "2016",
"2016", "2016", "2016"), mdlpreds = structure(c(4L, 2L, 3L, 1L,
3L, 2L, 4L, 2L, 3L, 3L, 4L, 2L, 1L, 2L, 3L, 1L, 3L, 3L, 4L, 4L,
1L, 1L, 1L, 3L, 2L, 3L, 3L, 4L, 4L, 4L, 2L, 3L, 4L, 2L, 4L, 1L,
3L, 3L, 3L, 3L, 2L, 1L, 4L, 2L, 4L, 3L, 1L, 4L, 4L, 4L, 3L, 4L,
2L, 2L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 3L, 3L, 4L, 4L, 3L, 2L, 1L,
3L, 2L, 3L, 1L, 2L, 1L, 3L, 1L, 1L, 3L, 2L, 2L, 2L, 1L, 1L, 1L
), .Label = c("phv", "phvfsca", "phvaso", "phvasofsca"), class = "factor")), class = "data.frame", .Names = c("mdldte",
"simdte", "r2", "simyr", "mdlpreds"), row.names = c(NA, -84L))

Answer

The issue is that your dates are currently being interpreted as character data, and R is shuffling them a little. What you really want is for them to be treated as genuine Date objects, and then let ggplot's higher-level functions handle the ordering and labeling accordingly.

Convert the date data to Date type:

tmp3$newdate <- as.Date(strptime(tmp3$simdte, '%Y%m%d'))

Specify the new dates as the x-values (no need to select only the unique values), and use scale_x_date to create pretty labels. Note that this also correctly spaces the data points across time, instead of using even spacing for each "level" of the date data.

plot.new <- ggplot(tmp3)+
    geom_point(aes(x= newdate, y=r2))+
    scale_x_date(date_labels = '%b-%d') +
    facet_wrap(~simyr, scales='free_x')+
    theme(axis.text.x=element_text(angle=45,hjust=1))
print(plot.new)

enter image description here

In the future, it's useful to be aware of the str function, which can quickly tell you the format of your data columns (also accessible from the Environment panel in RStudio):

str(tmp3)

'data.frame':   28 obs. of  7 variables:
 $ mdldte  : chr  "20150305" "20140531" "20160620" "20150305" ...
 $ simdte  : chr  "20130403" "20130429" "20130503" "20130525" ...
 $ r2      : num  0.542 0.485 0.54 0.4 0.594 ...
 $ simyr   : chr  "2013" "2013" "2013" "2013" ...
 $ mdlyr   : chr  "2015" "2014" "2016" "2015" ...
 $ mdlpreds: Factor w/ 4 levels "phv","phvfsca",..: 1 1 1 1 4 1 4 2 3 4 ...
 $ newdate : Date, format: "2013-04-03" "2013-04-29" "2013-05-03" "2013-05-25" ...

As you can see, your original "simdte" column is being stored as character data. R (and ggplot) will treat every value of the data as a unique level or category. Conversely, Date data are fundamentally numerical. R will treat them as continuous, which makes it easier to plot them accurately on a timeline or axis. It also makes it easier to separate the underlying data from the format of any plotting labels.

Update: Using dates as categories and plotting boxplots, in date order

If instead we wanted each date to act as a category (instead of having the date data act as a numerical distance), the solution is actually simpler. Strange things happen when you try to change the number of values being fed into a ggplot aesthetic, which I suspect is the root cause of your misordering problem.

The key is to rely on ggplot's built-in labeling functions. Once again, the main call to ggplot is fed the raw data, and scale_x_discrete handles the creation of pretty labels:

plot.new <- ggplot(tmp3)+
    geom_boxplot(aes(x=simdte,y=r2))+
    facet_wrap(~simyr, scales='free_x')+
    scale_x_discrete(labels = function(x) strftime(strptime(x, '%Y%m%d'), '%b-%d'))+
    theme(axis.text.x=element_text(angle=45,hjust=1))
print(plot.new)

enter image description here

Comments