Mahesh Yadav - 1 year ago 96

R Question

Many times the data, such as age is given is ranges. I want to calculate the mean of these ranges. I am able to calculate it but I feel there is more elegant and perhaps faster way.

Here is the working example:

`age <- c("0-10", "11-20", "21-30", "31-40") # define the age vector in ranges`

age_split<-strsplit(age,"-") # gives the list with splits

for(ii in 1:length(age)){

age[ii] <- mean(as.numeric(unlist(age_split[ii])))

}

print(age)

## [1] "5" "15.5" "25.5" "35.5"

Based on suggestions of lmo and akron, here is the code that can be performance tested from various methods:

`irows = 100000`

data1 <- paste0(sample(1:10, irows, replace = TRUE),"-", sample(11:20, irows, replace = TRUE))

data2 <- data1; data3 <- data1; data4 <- data1 # replicated for testing different methods

#--method 1 -- originally proposed

a1<-Sys.time()

age_split<-strsplit(data1,"-")

for(ii in 1:length(data1)){

data1[ii] <- mean(as.numeric(unlist(age_split[ii])))

}

Sys.time()-a1

# method 2 (lmo suggestion)

a2<-Sys.time()

data2 <- sapply(strsplit(data2, split="-"), function(i) mean(as.numeric(i)))

Sys.time()-a2

# method 3 (cue from akron)

a3<-Sys.time()

age_split_matrix <-do.call(rbind, strsplit(data3,"-"))

class(age_split_matrix) <- "numeric"

data3<-rowMeans(age_split_matrix)

Sys.time()-a3

# method 4 (akron proposed)

a4<-Sys.time()

data4 <-rowMeans(read.table(text=data4, sep = "-"))

Sys.time()-a4

# validating if outputs match

all.equal(as.numeric(data1),data2)

all.equal(as.numeric(data1),data3)

all.equal(as.numeric(data1),data4)

When irow = 100K, the time take from method 1 to 4 are: (1) 2.5s (2) 1.4s (3) 0.34s (4) 6.3s. When irow = 1mil, the time was (1) 23s (2) 14s (3) 6s (4) very long. When irow=10mil, the time was (1) 3.9 min (2) 2.9min (3) very long. This makes me conclude that read.table is really slow. Method 3 takes lot of memory.

Answer Source

Here is a one liner with `sapply`

:

```
sapply(strsplit(age, split="-"), function(i) mean(as.numeric(i)))
[1] 5.0 15.5 25.5 35.5
```

`strplit`

splits the strings on "-" and returns a list which is fed to `sapply`

which then takes each list item, converts the vectors to numeric and calculate the means.