Sam -4 years ago 91

R Question

Biologist and ggplot2 beginner here. I have a relatively large dataset of DNA sequence data (millions of short DNA fragments) which I first need to filter for quality for each sequence. I would like to illustrate how many of my reads are getting filtered out with a stacked bar plot using ggplot2.

I have figured out that ggplot likes the data in long format and have succesfully reformatted it with the melt function from reshape2

This is what a subset of the data looks like at the moment:

`library sample filter value`

LIB0 0011a F1 1272707

LIB0 0018a F1 1505554

LIB0 0048a F1 1394718

LIB0 0095a F1 2239035

LIB0 0011a F2 250000

LIB0 0018a F2 10000

LIB0 0048a F2 10000

LIB0 0095a F2 10000

LIB0 0011a P 2118559

LIB0 0018a P 2490068

LIB0 0048a P 2371131

LIB0 0095a P 3446715

LIB1 0007b F1 19377

LIB1 0010b F1 79115

LIB1 0011b F1 2680

LIB1 0007b F2 10000

LIB1 0010b F2 10000

LIB1 0011b F2 10000

LIB1 0007b P 290891

LIB1 0010b P 1255638

LIB1 0011b P 4538

library and sample are my ID variables (the same sample can be in multiple libraries). 'F1' and 'F2' mean that this many reads were filtered out during this step, 'P' means the remaining number of sequence reads after filtering.

I have figured out how to make a basic stacked barplot but now I am running into trouble because I cannot figure out how to properly reorder the factors on the x-axis so the bars are sorted in descending order in the plot based on the sum of F1, F2 and P. The way it is now I think they are sorted alphabetically within library based on sample name

`testdata <- read.csv('testdata.csv', header = T, sep = '\t')`

ggplot(testdata, aes(x=sample, y=value, fill=filter)) +

geom_bar(stat='identity') +

facet_wrap(~library, scales = 'free')

After some googling I found out about the aggregate function that gives me the total for each sample per library:

`aggregate(value ~ library+sample, testdata, sum)`

library sample value

1 LIB1 0007b 320268

2 LIB1 0010b 1344753

3 LIB0 0011a 3641266

4 LIB1 0011b 17218

5 LIB0 0018a 4005622

6 LIB0 0048a 3775849

7 LIB0 0095a 5695750

While this does give me the totals, I now have no idea how I can use this to reorder the factors, especially since there are two I need to consider (library and sample).

So I guess my question boils down to:

How can I order my samples in my graph based on the total of F1, F2 and P for each library?

Thank you very much for any pointers you can give me!

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

You are almost there. You need to change factor levels of `testdata$sample`

based on the aggregated data (I suppose no sample name appeared in both lib1 and lib0):

```
df <- aggregate(value ~ library+sample, testdata, sum)
testdata$sample <- factor(testdata$sample, levels = df$sample[order(-df$value)])
ggplot(testdata, aes(x=sample, y=value, fill=filter)) +
geom_bar(stat='identity') +
facet_wrap(~library, scales = 'free')
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**

Latest added