Rich Pauloo Rich Pauloo - 1 month ago 13
R Question

Filter, Subset, or Select Repeated IDs for Different Time Entries in a Dataframe

I have some timeseries data with values for some time indices but not others. I need a way to filter for all observations that occur in both time indices.

Here's a reproducible example that illustrates my problem. In the final graph I only want observations of

Type == a
, which occur in both time indices.

set.seed(1005)
mydat <- data.frame(
ID = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a5', 'a12', 'a13'),
Year = c(2000, 2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2001),
Result = rnorm(10, mean = 20, sd = 10),
Type = c('a','a','b','b','a', 'a', 'a', 'a', 'b', 'b'))

mydat %>%
ggplot(aes(x = Year, y = Result)) +
geom_point(aes(color = Type)) +
geom_line(aes(group = ID))


enter image description here

Note: I should also mention that the column
Type
does not exist in the original dataset. I created this toy dataset with the
Type
column to show the points that I want to get rid of in blue.

Solutions should be independent of the
Type
column, or alternatively, show how to generate the
Type
column without hard-coding it.

Answer Source

You can find the, okay let's called them repeated, repeated IDs for two time entries and mark them as type == a.

Using reshape:

You can reshape the data to wide format and remove the ones with NA which means they don't have data for both time entries. Look below:

mydat_a <- reshape(mydat, idvar = "ID", timevar = "Year", direction = "wide")

mydat_a #Those with NA are the ones that you set them as Type == b


#     ID Result.2000 Result.2001 
# 1   a1    14.39524   37.150650 
# 2   a2    17.69823   24.609162 
# 3   a3    35.58708          NA 
# 4   a4    20.70508          NA 
# 5   a5    21.29288    7.349388 
# 9  a12          NA   13.131471 
# 10 a13          NA   15.543380

#Add the types again
mydat_a$Type <- "a"
mydat_a[which(is.na(mydat_a), arr.ind=TRUE)[,1],]$Type <- "b"

#go back to long format
mydat_a <- reshape(mydat_a, direction="long", 
                   varying=list(names(mydat_a)[2:3]), v.names="Result", 
                   idvar="ID", timevar="Year", times=2000:2001)

 #remove NA
 mydat_a <- na.omit(mydat_a)

You can look for the final plotting solution below (use mydat_a instead of mydat in ggplot syntax).

Or...

mydat$Type <- "b" #make all of them "b" later change the repeated ones to "a"
mydat[  mydat$ID %in% mydat[mydat$Year==2000,]$ID
      & mydat$ID %in% mydat[mydat$Year==2001,]$ID,]$Type <- "a"
mydat$Type <- as.factor(mydat$Type)


mydat


#     ID Year   Result type 
# 1   a1 2000 17.67485    a 
# 2   a2 2000 15.16812    a 
# 3   a3 2000 27.18261    b 
# 4   a4 2000 14.18510    b 
# 5   a5 2000 32.91164    a 
# 6   a1 2001 13.30867    a 
# 7   a2 2001 20.15258    a 
# 8   a5 2001 31.21311    a 
# 9  a12 2001 32.62673    b 
# 10 a13 2001  6.85111    b

It gives you the types that you entered manually here.

Then you can use @d.b's solution:

ggplot(data = split(mydat, mydat$Type)$a, aes(x = Year, y = Result)) + 
       geom_point(aes(color = Type)) + geom_line(aes(group = ID))

enter image description here

Data:

set.seed(123)
mydat <- data.frame(ID = c('a1', 'a2', 'a3', 'a4', 'a5', 'a1', 'a2', 'a5', 'a12', 'a13'),
                    Year = c(2000, 2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2001),
                    Result = rnorm(10, mean = 20, sd = 10))