trotta trotta - 3 months ago 24
R Question

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.

Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.

As this is the third day I'm trying to solve this, I already found and tested

for
and
apply
but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with
for
):

for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character

Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)

OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val

Sub02[i,13] <- NEW_Val

}


I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!

EDIT:
Heres an example of my data-structure:

Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9


The output I expect is an updated value 'VAL9' in every row.

Answer

From what I understood so far, you need 2 things:

  1. sum up some values in one dataset
  2. add them to another dataset, using an ID variable

Besides what @yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:

a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
#   x id
# 1 1  a
# 2 2  b

b = data.frame(values = as.character(1:4), otherid = letters[1:2], 
               stringsAsFactors = FALSE)
sapply(b, class)
#      values     otherid 
# "character" "character"

Values is character now, we need to convert it to numeric:

b$values = as.numeric(b$values)
sapply(b, class)
#    values     otherid 
# "numeric" "character"

Then sum up the values in b (grouped by otherid):

library(dplyr)

b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
#   otherid sum_values
#     <chr>      <dbl>
# 1       a          4
# 2       b          6

Then join it with a - note that identifiers are specified in c():

ab = left_join(a, b, by = c("id" = "otherid"))
ab
#   x id sum_values
# 1 1  a          4
# 2 2  b          6

We can then add the result of the sum from b to the variable x in a:

ab$total = ab$x + ab$sum_values
ab
#   x id sum_values total
# 1 1  a          4     5
# 2 2  b          6     8

(Updated.)