eastafri eastafri - 3 months ago 7
R Question

insert values(rows) to a dataframe

I have a dataframe of this nature generated with a dplyr summary function.

pos nuc sample total
23 A 10028_1#2 3
23 C 10028_1#2 1
23 G 10028_1#2 5129
23 T 10028_1#2 128
231 C 10028_1#2 4
231 T 10028_1#2 3123
.
.


A bar plot of this data with ggplot2 gives an 'uneven' bars because pos 231 is missing its A and G total values for the corresponding sample name. The values are missing and are generated by a program outside of R.

What would be an idiomatic way of inserting 0 totals for each missing value of A,T,G,C at each position for each corresponding value. In other words how do i get this dataframe?

pos nuc sample total
23 A 10028_1#2 3
23 C 10028_1#2 1
23 G 10028_1#2 5129
23 T 10028_1#2 128
231 C 10028_1#2 4
231 T 10028_1#2 3123
231 G 10028_1#2 0
231 A 10028_1#2 0

Answer

We can use complete from tidyr

library(dplyr)
library(tidyr)
df1 %>% 
    complete(pos, nuc, nesting(sample), fill = list(total = 0))
#  pos   nuc    sample total    
#  <int> <chr>     <chr> <dbl>
#1    23     A 10028_1#2     3
#2    23     C 10028_1#2     1
#3    23     G 10028_1#2  5129
#4    23     T 10028_1#2   128
#5   231     A 10028_1#2     0
#6   231     C 10028_1#2     4
#7   231     G 10028_1#2     0
#8   231     T 10028_1#2  3123

Or we can use expand.grid/merge from base R

transform(merge(expand.grid(lapply(df1[1:3], unique)), 
         df1, all.x=TRUE), total = replace(total, is.na(total), 0))

data

df1 <- structure(list(pos = c(23L, 23L, 23L, 23L, 231L, 231L), 
 nuc = c("A", 
"C", "G", "T", "C", "T"), sample = c("10028_1#2", "10028_1#2", 
"10028_1#2", "10028_1#2", "10028_1#2", "10028_1#2"), total = c(3L, 
1L, 5129L, 128L, 4L, 3123L)), .Names = c("pos", "nuc", "sample", 
"total"), class = "data.frame", row.names = c(NA, -6L))
Comments