Mahesh Yadav - 1 year ago 170
R Question

# transforming range data to mean in R

Many times the data, such as age is given is ranges. I want to calculate the mean of these ranges. I am able to calculate it but I feel there is more elegant and perhaps faster way.

Here is the working example:

``````age <- c("0-10", "11-20", "21-30", "31-40") # define the age vector in ranges
age_split<-strsplit(age,"-") # gives the list with splits

for(ii in 1:length(age)){
age[ii] <- mean(as.numeric(unlist(age_split[ii])))
}
print(age)
## [1] "5"    "15.5" "25.5" "35.5"
``````

Based on suggestions of lmo and akron, here is the code that can be performance tested from various methods:

``````irows = 100000
data1 <- paste0(sample(1:10, irows, replace = TRUE),"-", sample(11:20, irows, replace = TRUE))
data2 <- data1; data3 <- data1; data4 <- data1 # replicated for testing different methods

#--method 1 -- originally proposed
a1<-Sys.time()
age_split<-strsplit(data1,"-")
for(ii in 1:length(data1)){
data1[ii] <- mean(as.numeric(unlist(age_split[ii])))
}
Sys.time()-a1

# method 2 (lmo suggestion)
a2<-Sys.time()
data2 <- sapply(strsplit(data2, split="-"), function(i) mean(as.numeric(i)))
Sys.time()-a2

# method 3 (cue from akron)
a3<-Sys.time()
age_split_matrix <-do.call(rbind, strsplit(data3,"-"))
class(age_split_matrix) <- "numeric"
data3<-rowMeans(age_split_matrix)
Sys.time()-a3

# method 4 (akron proposed)
a4<-Sys.time()
Sys.time()-a4

# validating if outputs match
all.equal(as.numeric(data1),data2)
all.equal(as.numeric(data1),data3)
all.equal(as.numeric(data1),data4)
``````

When irow = 100K, the time take from method 1 to 4 are: (1) 2.5s (2) 1.4s (3) 0.34s (4) 6.3s. When irow = 1mil, the time was (1) 23s (2) 14s (3) 6s (4) very long. When irow=10mil, the time was (1) 3.9 min (2) 2.9min (3) very long. This makes me conclude that read.table is really slow. Method 3 takes lot of memory.

Here is a one liner with `sapply`:
``````sapply(strsplit(age, split="-"), function(i) mean(as.numeric(i)))
`strplit` splits the strings on "-" and returns a list which is fed to `sapply` which then takes each list item, converts the vectors to numeric and calculate the means.