Alexa Alexa - 2 months ago 6
R Question

Subsetting dataframe by multiple row and column matches using dplyr

I've searched for an answer to this question for awhile and haven't found this exact test case. Please accept my apologies if it's been solved elsewhere.

I have a large dataframe

data
of biological observations (~2,500,000 rows, ~50 columns) from a government survey program. The important columns are
reg
(region),
species
, and
year
(of the survey).

I would like to subset the dataframe to contain only the species that were found at least once per year, in each region. (I also need to do this for species found at least 10 times per year, but I assume the solution is the same.)

In other words, I need to filter out only the rows for which a
species
is present in every year that a
region
is also present - essentially, all the rows where
unique(year)
for each
species
matches
unique(year)
for each
reg
. (There are only nine regions, so I could certainly subset the data into nine dataframes and repeat this filtering nine times over, but I'm hoping there is a more elegant solution using only
dplyr
that I can apply directly to
data
.)

Thank you!

EDIT: I created a mock dataset as suggested by @aichao.

> year = c(1999,2000,2000,2004,2004,2008)
> reg = c('ai','ci','ci','ai','ai','ci')
> species = c('blueei','greenei','yellowei','blueei','greenei','yellowei')
> df <- data.frame(year, reg, species)
> df
year reg species
1 1999 ai blueei
2 2000 ci greenei
3 2000 ci yellowei
4 2004 ai blueei
5 2004 ai greenei
6 2008 ci yellowei


So in this df, I want to keep only the species that were found every time the region was surveyed (let's pretend this is the complete data). From
ai
, I want to keep
blueei
- which was found in every survey year - but not
greenei
. Similarly, in
ci
, I want to keep
yellowei
but not
greenei
, since
greenei
did not turn up in 2008.

Answer

to keep only the species that were found every time the region was surveyed

We can do the following using dplyr:

library(dplyr)
result <- df %>% group_by(reg) %>% mutate(num.years.in.reg = length(unique(year))) %>% 
                 group_by(reg,species) %>% filter(length(unique(year)) == first(num.years.in.reg)) %>%
                 select(-num.years.in.reg) %>% arrange(reg)

Notes:

  1. First, group_by reg and create a column num.years.in.reg with the number of unique years that region is surveyed.
  2. Then, group_by both reg and species and keep only those species for which the number of unique years that the species is surveyed in the region is equal to the number of unique years that region is surveyed. Here, first just retrieves the first value from num.years.in.reg since they will be the same for all rows in the group.
  3. Remove the num.years.in.reg column and sort the result by reg.
  4. Note that in the filtering logic, we do not need to compare the unique year values, just the number of unique years, because the second grouping is a subset of the first grouping. That is, the unique years will be the same but there may be less of them in the second grouping than the first.

With the data you posted, the result is:

print(result)
##Source: local data frame [4 x 3]
##Groups: reg, species [2]
##
##   year   reg  species
##  <dbl> <chr>    <chr>
##1  1999    ai   blueei
##2  2004    ai   blueei
##3  2000    ci yellowei
##4  2008    ci yellowei