user2300940 user2300940 - 1 month ago 6
R Question

replace and remove part of string in rownames

I want to remove a part of the rownames in my data frame. I want to remove everything that do not match the string that is defined in the grepl below and replace it with the string defined behind. Does anyone know?

df[grepl(".*lncRNA.*|.*snRNA.*|.*snoRNA.*|.*precursor_RNA.*", rownames(df))] <- c("lncRNA","snRNA","snoRNA","precursor_RNA")



head(rownames(df))

[3208] "URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT"
[3209] "URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA"
[3210] "URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT"
[3211] "URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG"
[3212] "URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC"
[3213] "URS000075B2ED-lncRNA_CACTCAGGACCCACC"


out

[3208] "snoRNA"
[3209] "snRNA"
[3210] "snRNA"
[3211] "lncRNA"
[3212] "precursor_RNA"
[3213] "lncRNA"

Answer

We can use gsub to match one of more characters that are not a - ([^-]+) from the start (^) of the string followed by a - or (|) one or more characters that are not an underscore ([^_]+) until the end of the string ($) and replace it with blanks ("").

gsub("^[^-]+-|_[^_]+$", "", v1)
#[1] "snoRNA"        "snRNA"         "snRNA"         "lncRNA"       
#[5] "precursor_RNA" "lncRNA"  

If we are doing this on the rownames

gsub("^[^-]+-|_[^_]+$", "", rownames(df))

data

v1 <- c("URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT",
  "URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA", 
"URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT", 
"URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG", 
"URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC", 
"URS000075B2ED-lncRNA_CACTCAGGACCCACC")
Comments