Mike - 7 months ago 60

R Question

Hoping someone can help me with labelling columns of a grouped barchart with percentages. I couldn't find an existing post that I could make work successfuly. Below is the code for a basic example dataframe.

`Service<-c("AS","AS","PS","PS","RS","RS","ES","ES")`

Year<-c("2015","2016","2015","2016","2015","2016","2015","2016")

Q1<-c("Dissatisfied","Satisfied","Satisfied","Satisfied","Dissatisfied","Dissatisfied","Satisfied","Satisfied")

Q2<-c("Dissatisfied","Dissatisfied","Satisfied","Dissatisfied","Dissatisfied","Satisfied","Satisfied","Satisfied")

Example<-data.frame(Service,Year,Q1,Q2)

Next, I melted it with Reshape2 so that I could plot the Q1 and Q2 column variables along the x-axis. I then created a basic grouped barchart with ggplot2, with counts on the y-axis, and then a facet by year.

`ExampleM<-melt(Example,id.vars=c("Service","Year"))`

ggplot(ExampleM,aes(x=variable,stat="identity",fill=value)) +

geom_bar(position="dodge") + facet_grid(~Year)

What I'm struggling with is how to add column labels. Specifically I would like to know how to add basic frequency counts, as well as percentages. Not both together, but one or the other. I can't make either work. I've tried using "+geom_text(aes(labels=" but I'm not sure what to put as the label since I used stat="identity" in the ggplot code.

Also, for percentages, do I need to calculate it with dplyr first, or can I calculate the percentages within the ggplot code? I also don't know enough about labels in R, so not sure about how to add the actual % sign.

Hoping someone can show me a basic way to achieve all this!

Answer

You can add counts as text using `stat_count`

with `geom="text"`

. `..count..`

is the internal variable that `ggplot`

creates to hold the count values. The example below shows how to add both counts and percentages using `stat_count`

, though you can, of course, choose to include only one of them.

`stat="identity"`

doesn't do anything inside `aes`

. You would normally put it inside the geom. But in this case you don't want `stat="identity"`

because you actually want `ggplot`

to count the number of values in each category. You would use `stat="identity"`

with `geom_bar`

if you were using a data frame with a column that already contained the counts for each category.

To create the label text, use `paste0`

to combine the calculated values (e.g., `..count../sum(..count..)*100`

is the percentage) with text like the `%`

sign. Also, in this case I've used the newline character `\n`

to put the percentage and count on separate lines. `sprintf`

is a formatting function that in this case produces values rounded to one decimal place.^{1}

```
ggplot(ExampleM, aes(x=variable, fill=value)) +
geom_bar(position="dodge") +
stat_count(aes(label=paste0(sprintf("%1.1f", ..count../sum(..count..)*100),
"%\n", ..count..), y=0.5*..count..),
geom="text", colour="white", size=4, position=position_dodge(width=1)) +
facet_grid(~Year)
```

Here's an example where you pre-summarize the data and use `stat="identity"`

when plotting it: Say that instead of the percentages being the percent of all values, you want percentages within each quarter. Let's also stack the bars and add the percentages to the bars as text:

First, create the data summary. We'll use `dplyr`

so that we can use the chaining (`%>%`

) operator. We'll count the number of values, calculate percentages within each combination of `Year`

and `variable`

and we'll also add `n.pos`

to provide y-values for the text location in a stacked bar plot.

```
library(dplyr)
summary = ExampleM %>% group_by(Year, variable, value) %>%
tally %>%
group_by(Year, variable) %>%
mutate(pct = n/sum(n),
n.pos = cumsum(n) - 0.5*n)
```

Now for the plot. Note that we supply `y=n`

. Since we've pre-summarized the data (rather than having counts and percentages calculated inside `geom_bar`

) we need `stat="identity"`

.

```
ggplot(summary, aes(x=variable, y=n, fill=value)) +
geom_bar(stat="identity") +
facet_grid(.~Year) +
geom_text(aes(label=paste0(sprintf("%1.1f", pct*100),"%"), y=n.pos),
colour="white")
```

^{1} You can use `round`

instead, but I prefer `sprintf`

because it keeps a zero in the decimal place even when the decimal part is zero, while `round`

returns just the integer part when the decimal part is zero. For example, compare `round(3.04, 1)`

and `sprintf("%1.1f", 3.04)`

**UPDATE:** To answer the questions in your comments:

What's the reason for the second "group_by line"? We've calculated counts for each combination of Year, variable, and value. Now, we want to know, within each combination of Year and variable, what percent had value="Satisfied" and what percent had value="Dissatisfied". For that, we only want to group by Year and variable.

Explain the

`y=n.pos`

line. This is where we calculate the y-position for each percent label. We want the label in the middle of each bar, but the bars are stacked. If we used just`cumsum(n)`

the labels would be at the top of each bar section. We subtract`0.5*n`

so that the y-position of each label will be reduced by half the height of the bar section containing that label.Here's an example: Say we have three bar sections with heights 1, 2, and 3 (stacked from bottom to top in that order) and we want to calculate the y-positions for our labels.

`h = 1:3 cumsum(h) # 1 3 6 0.5 * h # 0.5 1.0 1.5 cumsum(h) - 0.5 * h # 0.5 2.0 4.5`

This gives y-positions that vertically center the label within each bar section.

How I can order the x-axis columns in descending order of percentages? By default, ggplot orders a discrete x-axis by the ordering of the categories of

`x`

variable. For a character variable, the ordering will be alphabetic. For a factor variable, the ordering will be the ordering of the levels of the factor.In my example, the levels of

`summary$variable`

are as follows:`levels(summary$variable) [1] "Q1" "Q2"`

To reorder by

`pct`

, one way would be with the`reorder`

function. Compare these (using the summary data frame from above):`summary$pct2 = summary$pct + c(0.3, -0.15, -0.45, -0.4, -0.1, -0.2, -0.15, -0.1) ggplot(summary, aes(x=variable, y=pct2, fill=value)) + geom_bar(position="stack", stat="identity") + facet_grid(~Year) ggplot(summary, aes(x=reorder(variable, pct2), y=pct2, fill=value)) + geom_bar(position="stack", stat="identity") + facet_grid(~Year)`

Notice that in the second plot, the order of "Q1" and "Q2" has now reversed. However, notice in the left panel, the Q1 stack is taller while in the right panel, the Q2 stack is taller. With faceting you get the same x-axis ordering in each panel, with the order determined (as far as I can tell) by comparing the

*sum*of all Q1 values and the*sum*of all Q2 values. The sum of Q2 is smaller, so they go first. The same happens when you use`position="dodge"`

, but I used "stack" to make it easier to see what's happening. The examples below will hopefully help clarify things.`# Fake data values = c(4.5,1.5,2,1,2,4) dat = data.frame(group1=rep(letters[1:3], 2), group2=LETTERS[1:6], group3=rep(c("W","Z"),3), pct=values/sum(values)) levels(dat$group2) [1] "A" "B" "C" "D" "E" "F" # plot group2 in its factor order ggplot(dat, aes(group2, pct)) + geom_bar(stat="identity", position="stack", colour="red", lwd=1) # plot group2, ordered by -pct ggplot(dat, aes(reorder(group2, -pct), pct)) + geom_bar(stat="identity", colour="red", lwd=1) # plot group1 ordered by pct, with stacking ggplot(dat, aes(reorder(group1, pct), pct)) + geom_bar(stat="identity", position="stack", colour="red", lwd=1) # Note that in the next two examples, the x-axis order is b, a, c, # regardless of whether you use faceting ggplot(dat, aes(reorder(group1, pct), pct)) + geom_bar(stat="identity", position="stack", colour="red", lwd=1) + facet_grid(.~group3) ggplot(dat, aes(reorder(group1, pct), pct, fill=group3)) + geom_bar(stat="identity", position="stack", colour="red", lwd=1)`

For more on ordering axis values by setting factor orders, this blog post might be helpful.