user5359531 user5359531 - 2 months ago 13
R Question

dplyr update original data frame with new values after subset

I am trying to change the value of entries in a dataframe column based on a series of conditions. I need to change the 'group' values of the top (or bottom) 10 entries of a certain type.

My data is in a dataframe that looks like this:

> head(diff_df_min)
external_gene_name gene_biotype Fold p.value group
1 RP11-431K24.1 lincRNA -4.13 4.86e-06 signif_fc
2 UBE4B protein_coding 2.42 3.91e-06 signif_fc
3 UBIAD1 protein_coding 2.74 5.58e-05 signif_fc
4 PTCHD2 protein_coding 3.37 2.68e-06 signif_fc
5 DRAXIN protein_coding 3.04 1.42e-06 signif_fc
6 VPS13D protein_coding 4.26 1.60e-07 signif_fc

> dim(diff_df_min)
[1] 1824 5


I have figured out this solution with
dplyr
:

diff_df_min %>%
filter(gene_biotype == "protein_coding") %>% # subset for protein coding genes
arrange(-Fold, p.value) %>% # Sort by Fold change, then by p value
slice(1:10) %>% # take the top 10 entries...
mutate(group = "top_signif_fc") # ... and change the "group" column value to "top_signif_fc"


This gives the exact results I want:

external_gene_name gene_biotype Fold p.value group
1 CROCC protein_coding 5.46 3.44e-14 top_signif_fc
2 KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
3 PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
4 RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
5 HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
6 SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
7 DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
8 ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
9 RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10 CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc


Except this is not updating these values in the original dataframe, it is only showing the results after applying the functions. Similarly, I have tried to do the same in
data.table
and figured out this method:

setDT(diff_df_min,key = "external_gene_name")
diff_df_min[gene_biotype == "protein_coding"][order(-Fold, p.value), head(.SD, 10)][,group := "top_signif_fc"]


But again this only RETURNS the results, it does not update the original dataframe.

external_gene_name gene_biotype Fold p.value group
1: CROCC protein_coding 5.46 3.44e-14 top_signif_fc
2: KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
3: PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
4: RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
5: HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
6: SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
7: DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
8: ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
9: RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10: CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc


You can see this when you check the values in the data frame after running any of these commands (or run a subset of the commands again):

> diff_df_min[which(diff_df_min['external_gene_name'] == "CROCC"),]
external_gene_name gene_biotype Fold p.value group
372 CROCC protein_coding 5.46 3.44e-14 signif_fc


And of course, if you try to use either of the methods like this:

diff_df_min <- ...


You end up overwriting the original dataframe with only the 10 lines that were selected with
dplyr
or
data.table
.

I had previously been doing similar things in base R, but could not get this case to work. I tried it, and ended up with this, which is ridiculous and does not work correctly:

diff_df_min[with(diff_df_min[which(diff_df_min['gene_biotype'] == "protein_coding"),], order(-Fold, p.value) ),"group"][1:top_gene_number] <- "top_signif_fc"


^^ Somewhere along the way, the indexes get messed up and so the entries that are eventually changed are not the intended entries.

I have read dozens and dozens of pages about this so far, including many tutorials and even this but so far I have been unable to find anything that actually gives a solution for this. I don't want to simply print out a modified dataframe, I want to update the original dataframe entries with the new entries.

Answer

We can use an ifelse statement to make the changes instead of slice to subset it and also replace the filter (that removes the rows) with arrangeing based on the "protein_coding" as well, and also assign the the output back to the original dataset or to a new

diff_df_minNew <- diff_df_min %>%
                     arrange(desc(gene_biotype == "protein_coding"), 
                                desc(Fold), p.value) %>% 
                     mutate(group = ifelse(row_number() < 11, "top_signif_fc", group))

A corresponding option using data.table would be

library(data.table)
diff_df_minNew2 <- setDT(diff_df_min)[order(-(gene_biotype=="protein_coding"),
      -Fold, p.value)][seq_len(10), group := "top_signif_fc"][]