cumin cumin - 1 month ago 13
R Question

showing values having counts of another value

I am looking for outliers in data; which places are the worst?

place = rep(c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI'), times=4)
measure = rep(c('meas1','meas2','meas3','meas4'), each=11)
set.seed(200)
rating = sample(c('good','bad'), size = 44, prob=c(2,1), replace=T)
df = data.frame(place, measure, rating)


This takes multiple steps now; first finding the places with any bad ratings and putting that result into a table, to tell me CA is the worst, with 3 "bads":

bads = df %>%
filter(rating == 'bad')

sort(table(bads$place), decreasing = T)
> bads
place measure rating
1 AR meas1 bad
2 CA meas1 bad
3 CO meas1 bad
4 CT meas1 bad
5 DE meas2 bad
6 AK meas3 bad
7 CA meas3 bad
8 AK meas4 bad
9 CA meas4 bad
10 FL meas4 bad
11 GA meas4 bad

> sort(table(bads$place), decreasing = T)

CA AK AR CO CT DE FL GA AL AZ HI
3 2 1 1 1 1 1 1 0 0 0


[Please disregard the folowing...]
Also, if the rating were numerical instead of categorical, is there an analogous way to get the places which fall into the highest/lowest quartile (not sure if that would be a separate question)?

If another post answers this, please point to it; I have looked at quite a few posts about counting and aggregation so far.

Answer Source

The data.table package is pretty handy for this kind of thing:

> data.table(df)[rating=='bad', .N,by="place"][order(-N)]
   place N
1:    CA 3
2:    AK 2
3:    AR 1
4:    CO 1
5:    CT 1
6:    DE 1
7:    FL 1
8:    GA 1