fabiusw fabiusw - 3 months ago 7
R Question

Find pairs of rows with identical values in different columns

I'm trying to subset some data but got stock at this part. My data looks like this:

structure(list(sym_id = structure(c(1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 4L, 5L, 5L), .Label = c("AOL.HH", "ARCH.GA", "ARCH.GK",
"T.GJ", "T.GK"), class = "factor"), comp = structure(c(1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("AOL", "ARCH",
"T"), class = "factor"), seq_nb = c(18327L, 9952L, 39808L,
56601L, 44974L, 55302L, 20023L, 24403L, 15529L, 46202L, 57269L
), orig_seq_nb = c(81261L, 72161L, 9952L,
1276L, 98216L, 16423L, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), .Names = c("bond_sym_id",
"company_symbol", "seq_nb", "orig_seq_nb"), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))


I'm looking for a code that would give me back rows which have identical values in different columns but also identical values in another.
The output should give me back

Row1 ARCH.GA ARCH 9952 72161

Row2 ARCH.GA ARCH 39808 9952


As you can see, the columns "sym_ID" and "comp" are equal for my desired output and the values in "seq_nb" and "orig_seq_nb" match.

Appreciate your help!

Answer

We subset the dataset with 3rd and 4th columns, loop through the rows, order, get the 1st element, cbind with the first two columns, use duplicated to find the logical index of duplicate elements and this can be used for subsetting the rows of 'df1'.

d2 <- cbind(df1[1:2], apply(df1[3:4],1, function(x) x[order(x)][1]))
df1[duplicated(d2)|duplicated(d2, fromLast=TRUE),]
#  bond_sym_id company_symbol seq_nb orig_seq_nb
#      <fctr>         <fctr>  <int>       <int>
#1     ARCH.GA           ARCH   9952       72161
#2     ARCH.GA           ARCH  39808        9952
Comments