E B E B - 2 months ago 20
R Question

in R find duplicates by column 1 and filter by not NA column 3

I have a dataframe:

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(1,NA,2,4,NA,1,2,2)
df <-data.frame(a,b,c)


I have a dataframe with some duplicate variables in column 1 but when I use the duplicated function, it randomly chooses the row after de-duping using duplicate(function)

dedup_df = df[!duplicated(df$a), ]


How can I ensure that the output returns me the row that does not contain an NA on column c ?

I tried to use the dplyr package but the output prints only a result

library(dplyr)
options(dplyr.print_max = Inf )
df %>% ## source dataframe
group_by(a) %>% ## grouped by variable
filter(!is.na(c) ) %>% ## filter by Gross value
as.data.frame(dedup_df)

Answer

Your use of duplicated function to remove duplicate observations (lines) using a column as key from a data frame is correct.

But it seems that you are worried that it may keep a line that contains NA in another column and drop another line that contains a non NA value.

I'll use you example, but with a slight modification

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c(NA,1,2,4,NA,1,2,2)
df <-data.frame(a,b,c)

> df
  a b  c
1 A 1 NA
2 A 1  1
3 A 2  2
4 B 4  4
5 B 1 NA
6 B 1  1
7 C 2  2
8 C 2  2

In this case, your dedup_df contains an NA for the first value.

> dedup_df = df[!duplicated(df$a), ]
> dedup_df
  a b  c
1 A 1 NA
4 B 4  4
7 C 2  2

Solution:

Reorder df by column c first and then use the same command. This reordering by column c will send all NAs to the end of the data frame. When the duplicated passes it will see these lines having NA last and will tag them as TRUE if there was a previous one without NA.

df = df[order(df$c),]
dedup_df = df[!duplicated(df$a), ]

> dedup_df
  a b c
2 A 1 1
6 B 1 1
7 C 2 2

You can also reorder in descending order

df = df[order(df$c,decreasing = T),]
dedup_df = df[!duplicated(df$a), ]

> dedup_df
  a b c
4 B 4 4
3 A 2 2
7 C 2 2