aminards aminards - 2 months ago 6
R Question

Select rows of a data frame based on string match then sum and transform those rows

I have a list of 18 data frames, each one with a different number of rows and random blank rows throughout, structured like this:

l = c("D1", "D1", "D1", "", "D1", "D1", "D1", "D2", "D2", "D2", "D2", "", "D3", "D3", "D3", "D3")
a = c("Al", "Al", "St", "", "St", "Un", "St", "Al", "Al", "St", "St", "", "Al", "Al", "St", "St")
b = c(6000, 4980, 123, "", 98, 87, 51, 10989, 8756, 457, 233, "", 989, 743, 67, 55)
mydf = data.frame("Location" = l, "Name" = a, "count" = b)
mydf
Location Name count
1 D1 Al 6000
2 D1 Al 4980
3 D1 St 123
4
5 D1 St 98
6 D1 Un 87
7 D1 St 51
8 D2 Al 10989
9 D2 Al 8756
10 D2 St 457
11 D2 St 233
12
13 D3 Al 989
14 D3 Al 743
15 D3 St 67
16 D3 St 55


My data with 18 data frames:

sapply(mydata, dim)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
[1,] 171 146 132 147 149 148 138 143 114 111 115 101 112 218 122 96 156 128
[2,] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10


I want to use
transform
to add a column to my data frame with the sum of the Al counts for each location. I plan on filling the entire new column with the same value. I thought I could find those rows that contain the string "Al" and then apply the sum function but this isn't working. Here is the code I am trying to use:

add.al.sum = function(df){
als = df[grep("Al", df$Name),]
alsum = sum(als, na.rm = TRUE)
transform(df, Al.sum = c(alsum))
}
mydf = lapply(mydf, function(x) add.al.sum(x))


This doesn't do what I want. I want my transformed data table to look like this; preserving the blank lines is not necessary.

Location Name count Al.sum
1 D1 Al 6000 10980
2 D1 Al 4980 10980
3 D1 St 123 10980
4 NA
5 D1 St 98 10980
6 D1 Un 87 10980
7 D1 St 51 10980
8 D2 Al 10989 19745
9 D2 Al 8756 19745
10 D2 St 457 19745
11 D2 St 233 19745
12 NA
13 D3 Al 989 1732
14 D3 Al 743 1732
15 D3 St 67 1732
16 D3 St 55 1732


It actually tells me
Error in df$Name : $ operator is invalid for atomic vectors
. The only difference between my real data and my example data is that the
count
column in my real data is listed as an
int
instead of a
num
. I need this to work with the
int
values in the
count
column.

After I get the sums of the Al counts I will repeat the process to get the sums of the St and Un counts in their own respective columns.

EDIT: I have expanded the example data and added a bit more information about the list of data frames I am working with.

Answer

This uses no packages and is only one line of code. (Name == 'Al') * count equals count for those rows for which Name equals 'A1' and equals 0 for those that are not. (Alternately we could have replaced that expression with ifelse(Name == 'A1', count, 0) ). We then use ave to sum them by Location. Note that we are using mydf0 defined in Note 2 at the bottom.

transform(mydf0, Al.sum = ave((Name == 'Al') * count, Location, FUN = sum))

giving:

   Location Name count Al.sum
1        D1   Al  6000  10980
2        D1   Al  4980  10980
3        D1   St   123  10980
4        D1   St    98  10980
5        D1   Un    87  10980
6        D1   St    51  10980
7        D2   Al 10989  19745
8        D2   Al  8756  19745
9        D2   St   457  19745
10       D2   St   233  19745

The example data had no NA values so we did not bother with the sum(..., na.rm = TRUE) but if this is to be extended to other data where there are NA values that are to be removed then replace sum with function(x) sum(x, na.rm = TRUE) .

list

The question mentions multiple data frames but does not give an example; so let us assume that we have the list L of data frames shown below. Then:

L <- list(mydf0, mydf0)

lapply(L, transform, Al.sum = ave((Name == 'Al') * count, Location, FUN = sum))

Note: We can fix the solution in the question like this. grepl results in a logical vector which is regarded as a 0/1 vector when multiplying so the non-Al counts are zeroed. Then by applies this by Location and rbind puts the pieces (one per Location) together.

add.al.sum = function(df) {
    transform(df, Al.sum = sum(grepl("Al", Name) * count, na.rm = TRUE))
}
do.call("rbind", by(mydf0, mydf$Location, add.al.sum))

Note 2: Fix up your data frames first:

mydf0 <- mydf # preserve mydf just in case
mydf0[] <- lapply(mydf0, as.character) # make all cols character
mydf0 <- transform(mydf0, count = as.numeric(count)) # make count numeric
mydf0 <- subset(mydf0, Location != "") # remove blank lines

Even easier might be to ensure that when you read in the data that you ignore blank lines and make the columns character but that would depend on how they are read in.

Comments