Mr.Awan Mr.Awan - 1 month ago 7
R Question

Extracting the Package name from fully defined class names using R scripting

I have following sort of data set(ds1) in my CSV file that includes class Name and corresponding faults. I intend to extract or filter Package Name from the data having number of faults equal to 2 using R script.

Class Faults

org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Delete 1
org.apache.tools.ant.taskdefs.Deltree 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Echo 1
org.apache.tools.ant.Exec 2
org.apache.tools.ant.Exec 2


I have written following code, but, it does not produce desired output

dschanged<- subset(ds1, grep( "/^([^\\.]+)/", class) & Faults==2 )


Technically, I require proper regular expression to pull the string before last dot(.) to generate following output.

org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant 2
org.apache.tools.ant 2

Answer

grep (and grepl) are inappropriate for this: you aren't filtering based on textual content. You are (a) filtering based on Faults, and (b) changing the text in Class.

Your data:

ds1 <- structure(list(Class = c("org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Delete", "org.apache.tools.ant.taskdefs.Deltree", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Echo", "org.apache.tools.ant.Exec", "org.apache.tools.ant.Exec"),
                      Faults = c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L)),
                 .Names = c("Class", "Faults"), class = "data.frame", row.names = c(NA, -12L))

Filter on Faults (you already had this). You only need one of these two commands, they both do the same thing; the major differences are in readability (personal preference) and performance (the second one, in this case, takes about 35% less time, though since they are both measured in microseconds, it seems silly to compete).

ds2 <- subset(ds1, Faults == 2)
ds2 <- ds1[ds1$Faults == 2,]

Update Class to remove the last word (and dot):

ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class)
ds2
#                            Class Faults
# 1  org.apache.tools.ant.taskdefs      2
# 2  org.apache.tools.ant.taskdefs      2
# 4  org.apache.tools.ant.taskdefs      2
# 5  org.apache.tools.ant.taskdefs      2
# 6  org.apache.tools.ant.taskdefs      2
# 7  org.apache.tools.ant.taskdefs      2
# 8  org.apache.tools.ant.taskdefs      2
# 9  org.apache.tools.ant.taskdefs      2
# 11          org.apache.tools.ant      2
# 12          org.apache.tools.ant      2

Note: this can also be done with sub instead of gsub, but the latter is my first-reach since most of my uses deal with larger and repeating regexes. The major (only?) difference between the two is that:

'sub' and 'gsub' perform replacement of the first and all matches respectively

(from ?sub).

I know of no tool that does both the filtering and changing in a single command (though perhaps data.table does, I don't know).

Similar to @egnha's solution (that uses magrittr), here's one using dplyr, which many people allege is very easy to read and adapt (at the potential cost of performance):

library(dplyr)
ds2 <- ds1 %>%
  filter(Faults == 2) %>%
  mutate(Class = gsub("\\.[^.]*$", "", Class))

Since I mentioned performance, here's a comparison:

microbenchmark(indexing = { ds2 <- ds1[ds1$Faults == 2,]; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
               subset   = { ds2 <- subset(ds1, Faults == 2) ; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
               dplyr    = { ds1 %>% filter(Faults == 2) %>% mutate(Class = gsub("\\.[^.]*$", "", Class)) })
# Unit: microseconds
#      expr      min        lq      mean    median        uq      max neval
#  indexing   71.841   87.7045  109.4496  104.2975  120.7075  269.493   100
#    subset  102.473  115.6020  147.0108  139.1230  165.5620  287.726   100
#     dplyr 1067.030 1156.3745 1323.1174 1225.4805 1351.2920 4270.308   100

For the record, dplyr used in this way is not often this speed-poor in comparison to other methods. It is not commonly faster, but it is not often an order-of-magnitude slower.

Comments