gedehu gedehu - 9 months ago 54
R Question

Repetitive Action Over Ten Matrices in R

I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).

The code I have written is as follows:

Average.Rating.per.Interval <- data.frame(interval=as.numeric(),
##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)

Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)

e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,

e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,
c(1,"technician",mean(e.1.technician$rating))) <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]
Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,

This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.

Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)

Answer Source

I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:

ddply(d1, .(occupation), summarise, mean_rating=mean(rating))

This shows the results for all occupations, and you only wanted a specific three, so we can filter it to those:

ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))

Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:

dataset_list <- list(d1=d1) # you would put all of them here; I just have one

Now we can run the same code on all of them, with lapply, and get a list back out:

filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
    ddply(subset(dataset,occupation %in% filtered_occupations), 
    .(occupation), summarise, mean_rating=mean(rating))} )


  occupation mean_rating
1     artist    3.540984
2  marketing    3.147208
3 technician    3.519512