aminards aminards - 25 days ago 11
R Question

How do I restrict the standard deviation bars in my barplot to a maximum value?

I am creating barplots with standard deviation bars using

ggplot2
. My data frame is quite large but here is a truncated version for example:

SampleName Target.ID Maj.Allele.Freq SD AVG.MAF
W15-P2-1 rs1005533 99.74811083 24.98883743 93.70753223
W15-P2-2 rs1005533 100 24.98883743 93.70753223
W15-P2-3 rs1005533 100 24.98883743 93.70753223
W15-P2-4 rs1005533 100 24.98883743 93.70753223
W15-P2-1 rs1005533 99.94819995 24.98883743 93.70753223
W15-P2-2 rs1005533 100 24.98883743 93.70753223
W15-P2-3 rs1005533 100 24.98883743 93.70753223
W15-P2-4 rs1005533 100 24.98883743 93.70753223
W21-P2-1 rs1005533 100 24.98883743 93.70753223
W21-P2-2 rs1005533 100 24.98883743 93.70753223
W21-P2-3 rs1005533 99.90044798 24.98883743 93.70753223
W21-P2-4 rs1005533 99.72375691 24.98883743 93.70753223
W21-P2-1 rs1005533 100 24.98883743 93.70753223
W21-P2-2 rs1005533 100 24.98883743 93.70753223
W21-P2-3 rs1005533 100 24.98883743 93.70753223
W21-P2-4 rs1005533 0 24.98883743 93.70753223
W15-P2-1 rs10092491 52.40641711 1.340954343 51.8604281
W15-P2-2 rs10092491 53.69923603 1.340954343 51.8604281
W15-P2-3 rs10092491 52.56689284 1.340954343 51.8604281
W15-P2-4 rs10092491 50.11764706 1.340954343 51.8604281
W15-P2-1 rs10092491 50.30094583 1.340954343 51.8604281
W15-P2-2 rs10092491 50.96277279 1.340954343 51.8604281
W15-P2-3 rs10092491 50.94102886 1.340954343 51.8604281
W15-P2-4 rs10092491 51.2849162 1.340954343 51.8604281
W21-P2-1 rs10092491 53.56976202 1.340954343 51.8604281
W21-P2-2 rs10092491 50.27861123 1.340954343 51.8604281
W21-P2-3 rs10092491 52.8358209 1.340954343 51.8604281
W21-P2-4 rs10092491 51.42585551 1.340954343 51.8604281
W21-P2-1 rs10092491 52.77890467 1.340954343 51.8604281
W21-P2-2 rs10092491 52.89017341 1.340954343 51.8604281
W21-P2-3 rs10092491 53.70786517 1.340954343 51.8604281
W21-P2-4 rs10092491 50 1.340954343 51.8604281


Because the average values in the last column (
AVG.MAF
) can produce standard deviation bars that exceed the maximum of 100, the plot shows the bars beyond the limit on the y axis of 100.

Example Standard Deviation bars extend beyond 100.

Here is the code to create the above plot:

pe1 = ggplot(half1, aes(x=Target.ID, y=AVG.MAF))+
geom_bar(stat = "identity", position = "dodge", colour = "black",
width = 0.5, fill = "yellowgreen")+xlab("")+
ylab("Average Major Allele Frequency")+
labs(title="Allele Balance AmpliSeq Identity Sample P2")+
geom_errorbar(aes(ymin = AVG.MAF-SD, ymax = AVG.MAF+SD),
width = 0.4, position = position_dodge(0.9),
size = 0.6)+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))


I tried truncating the plot using
coord_cartesian
but this kind of makes the plot look like I am hiding some data:

Here the top of the standard deviation bars is cut off

Here is the code to create the plot with the standard deviation bars cut off:

pe1 = ggplot(half1, aes(x=Target.ID, y=AVG.MAF))+geom_bar(stat = "identity", position = "dodge", colour = "black", width = 0.5, fill = "yellowgreen")+xlab("")+ylab("Average Major Allele Frequency")+labs(title="Allele Balance AmpliSeq Identity Sample P2")+geom_errorbar(aes(ymin = AVG.MAF-SD, ymax = AVG.MAF+SD), width = 0.4, position = position_dodge(0.9), size = 0.6)+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))+coord_cartesian(ylim=c(0,100))


It seems like there has to be a way to restrict the standard deviation bars to my intended ymax of 100 and still keep the top horizontal bar visible in the plot. Does any one know how to do this?

Answer

In addition to the issues people have raised in the comments, here are a couple of other considerations:

  1. You don't need to add a column that repeats the mean for every row of your data. Instead, you can calculate and plot the mean within ggplot, using the actual data values in Maj.Allele.Freq. (In fact, by using a column for the y-value that repeats the mean value over and over for each Target.ID, you're actually plotting multiple copies of the mean bar, one on top of the other.)

    You can also summarize the data (i.e., calculate the mean and standard deviations) outside of ggplot and then use the summarized data frame for plotting. That's sometimes necessary in more complex situations, but you can do it all within ggplot here.

  2. It seems to me points would work better than bars here.

The code below provides both the point and bar versions and also shows how to add either the standard deviation of the data or 95% confidence interval of the mean of the data. The blue lines represent the standard deviations, while the red lines represent the 95% confidence interval.

I've provided bootstrapped confidence intervals. To provide classical normal confidence intervals, switch from mean_cl_boot to mean_cl_normal.

If you want the y-axis to go down to zero, add coord_cartesian(ylim=c(0,150)) or whatever maximum y-value you wish (as the comments discuss, to avoid a misleading graph, it should be above the top of the error bar, regardless of whether the bar represents the SD or CI).

ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
  stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, colour="blue") +
  stat_summary(fun.data=mean_sdl, geom="point", colour="blue", size=3) +
  stat_summary(fun.data = mean_cl_boot, colour="red", geom="errorbar", width=0.1) +
  stat_summary(fun.data = mean_cl_boot, colour="red", geom="point") +
  labs(x="", y="Average Major Allele Frequency", 
       title="Allele Balance AmpliSeq\nIdentity Sample P2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) 

enter image description here

ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
  stat_summary(fun.y=mean, geom="bar", fill="yellowgreen", colour="black") +
  stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, size=1, colour="blue") +
  stat_summary(fun.data = mean_cl_boot, colour="red", geom="errorbar", width=0.1, size=0.7) +
  labs(x="", y="Average Major Allele Frequency", 
       title="Allele Balance AmpliSeq\nIdentity Sample P2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))   

enter image description here

You could also put both the SD and 95% CI on the same plot:

pnp = position_nudge(x=0.1)
pnm = position_nudge(x=-0.1)

ggplot(half1, aes(x=Target.ID, y=Maj.Allele.Freq)) +
  stat_summary(fun.data=mean_sdl, geom="errorbar", width=0.1, position=pnp, aes(colour="SD")) +
  stat_summary(fun.data=mean_sdl, geom="point", position=pnp, aes(colour="SD")) +
  stat_summary(fun.data = mean_cl_boot, geom="errorbar", width=0.1, 
               position=pnm, aes(colour="95% CI")) +
  stat_summary(fun.data = mean_cl_boot, geom="point", position=pnm, aes(colour="95% CI")) +
  labs(x="", y="Average Major Allele Frequency", colour="",
       title="Allele Balance AmpliSeq\nIdentity Sample P2") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

enter image description here

Comments