Dr.Jay Dr.Jay - 20 days ago 8
R Question

How to run for loop faster in R when dealing with large dataframe

This is how my data frame looks something like this:

Code-1 Type Year Code-2
AB1034510 Type-A 2014 501324
AB1034927 Type-C 2013 501324
AB1039701 Type-B 2012 501325
AB1036802 Type-D 1998 501325
AB1031649 Type-F 2016 501328


but this dataframe has about 4.5 million rows with 12 columns.

I was trying to run a for loop so that I can find rows with same Code-2 values and find the oldest year from those rows. And then, assign that oldest year to the rows with the same Code-2 values. So it will look like this:

Code-1 Type Year Code-2
AB1034510 Type-A 2013 501324
AB1034927 Type-C 2013 501324
AB1039701 Type-B 1998 501325
AB1036802 Type-D 1998 501325
AB1031649 Type-F 2016 501328


to get this procedure done, I tried the code:

for (n in 1:nrow(df)) {
same.code2=which(df[n,4]==df[,4])
min.year=min(df[same.code2,3])
df[same.code2,3]=min.year
}


But it seems either I have done something wrong or it takes too long to run the code.

Any help pretty please?

Answer

Use data.table. It's fast and simple.

library(data.table)    
dt <- data.table("Code-1" = c('AB1034510', 'AB1034927', 'AB1039701', 'AB1036802'),
                 Type = c('Type-A', 'Type-C', 'Type-B', 'Type-D'),
                 Year = c(2014, 2013, 2012, 1998),
                 "Code-2" = c(501324,501324,501325,501325))

dt[, Year := min(Year), by = 'Code-2']

Data before:

      Code-1   Type Year Code-2
1: AB1034510 Type-A 2014 501324
2: AB1034927 Type-C 2013 501324
3: AB1039701 Type-B 2012 501325
4: AB1036802 Type-D 1998 501325

And afterwards:

      Code-1   Type Year Code-2
1: AB1034510 Type-A 2013 501324
2: AB1034927 Type-C 2013 501324
3: AB1039701 Type-B 1998 501325
4: AB1036802 Type-D 1998 501325