SPZ SPZ - 3 months ago 18
R Question

R subset rows for all occurrences of first variable in column

I would like to create a subset of a large data frame based on two columns, ColA and ColB (below). For each variable in ColA (e.g. A, B, C...), I would like to extract the rows corresponding to each occurrence of the first variable in ColB. Therefore DF1:

ColA ColB ColC
A Red 7thing
A Red OneBot
A Blue BotOne
B Green Thing7
B Green Twosies
B Green Square
B Yellow Circle
B Yellow Polygon
B Purple Triangle
B White Octagon
C Orange Cube
C Black Line


Would become DF2:

ColA ColB ColC
A Red 7thing
A Red OneBot
B Green Thing7
B Green Twosies
B Green Square
C Orange Cube


I am not concerned about duplicates in ColC, and there is no pattern to the number of occurrences of a variable in ColB. The related subsetting strategies I have found focus on extracting unique cases, but I have neither come across nor been able to devise an approach to permit selecting all occurrences of the first variable type and would be grateful for some assistance.

I have tried

DF2 <- DF1[match(unique(DF1$ColB), DF1$ColB),]


...and subsetting that targets the variable when the variable content is known (e.g. Subset first n occurrences of certain value in dataframe in R), these are not appropriate seemingly.

Answer

The power of ave in base R, to compare all ColB cases to the first ColB in each group:

dat[with(dat, ColB == ave(ColB, ColA, FUN=function(x) head(x,1) )),]

#   ColA   ColB    ColC
#1     A    Red  7thing
#2     A    Red  OneBot
#4     B  Green  Thing7
#5     B  Green Twosies
#6     B  Green  Square
#11    C Orange    Cube

Using your original logic, you could also merge back on the non-duplicated records of ColA/ColB only:

merge(dat, dat[c("ColA","ColB")][!duplicated(dat$ColA),])