gedehu - 5 months ago 28

R Question

I have ten datasets, and each dataset contains "ratings" and "occupation" columns. From each of those ten datasets I want to find out the "average" of "ratings" per three occupation groups (i.e. artists, technician, marketing).

The code I have written is as follows:

`Average.Rating.per.Interval <- data.frame(interval=as.numeric(),`

occupation=as.character(),

average.rating=as.numeric(),

stringsAsFactors=FALSE)

##interval number refers to the dataset number (e.g. for 'e.1' it is 1, for 'e.2' it's 2)

Average.Rating.per.Interval <- as.matrix(Average.Rating.per.Interval)

e.1.artist <- e.1[which(e.1[,"occupation"]=='artist', arr.ind = TRUE),]

mean(e.1.artist$rating)

Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,

c(interval=1,occupation="artist",average.rating=mean(e.1.artist$rating)))

e.1.technician <- e.1[which(e.1[,"occupation"]=='technician', arr.ind = TRUE),]

mean(e.1.technician$rating)

Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,

c(1,"technician",mean(e.1.technician$rating)))

e.1.marketing <- e.1[which(e.1[,"occupation"]=='marketing', arr.ind = TRUE),]

mean(e.1.marketing$rating)

Average.Rating.per.Interval <- rbind(Average.Rating.per.Interval,

c(1,"marketing",mean(e.1.marketing$rating)))

This is clearly not efficient at all, because for ten datasets, I have to rewrite the same code 9 more times to get the average ratings for each of those occupations groups for all of my ten datasets. Is there a better way to do this? I cannot think of anything better! I found out that apply/lapply can be a way to do this, but I could not figure out how they can work for my case.

Two of my datasets (e1 and e2) can be found here. (I have only included 10% of the entire observations in each)

Answer

I recommend the "plyr" package for this kind of manipulation; it is well worth the investment of an hour or so to learn. In your case, I loaded your first example dataset in "d1", and I can summarise it like so:

```
ddply(d1, .(occupation), summarise, mean_rating=mean(rating))
```

This shows the results for *all* occupations, and you only wanted a specific three, so we can filter it to those:

```
ddply(subset(d1, occupation %in% c('artist','technician','marketing')), summarise, mean_rating=mean(rating))
```

Now we just need to generalize it to running over 10 datasets without cut and paste. Let's store our data frames inside a list:

```
dataset_list <- list(d1=d1) # you would put all of them here; I just have one
```

Now we can run the same code on all of them, with lapply, and get a list back out:

```
filtered_occupations <- c('artist','technician','marketing')
lapply(dataset_list, function(dataset) {
ddply(subset(dataset,occupation %in% filtered_occupations),
.(occupation), summarise, mean_rating=mean(rating))} )
```

Result:

```
$d1
occupation mean_rating
1 artist 3.540984
2 marketing 3.147208
3 technician 3.519512
```