pjbudden pjbudden - 3 months ago 8
R Question

Subset specific text from a free-text character variable in R when the position is random

I have a dataframe which looks a little like

uniqueID <- sample(1:10, 3, replace=F)
faultDetails <- c("fault is blah blah blah and serial is F1AB3CDE123", "Serial No: f1ds6hrc813, fault is blah blah blah", "fault: blah blah blah serial number: F2GE8FDW159")
myData <- cbind.data.frame(uniqueID, faultDetails)


I need to extract the serial number from each entry and store in a new variable, however the position within the field is not fixed. The only constant is that the serial number is of the format F, number, letter, letter, number, letter, letter, letter, number, number, number. Also, the case of the text changes from entry to entry (oh the joys of free-text!).

I'd like to do this as part of my dplyr pipeline. Any help is appreciated.

Answer

The case isn't much of a concern with the right regular expression. I would approach it like this:

library(dplyr)
library(stringr)

uniqueID <- sample(1:10, 3, replace=F)
faultDetails <- c("fault is blah blah blah and serial is F1AB3CDE123", "Serial No: f1ds6hrc813, fault is blah blah blah", "fault: blah blah blah serial number: F2GE8FDW159")
myData <- cbind.data.frame(uniqueID, faultDetails) %>%
  mutate(serial_number = str_extract(faultDetails,
                                     "(F|f)\\d{1}\\w{2}\\d{1}\\w{3}\\d{3}"))

myData

  uniqueID                                      faultDetails serial_number
1        7 fault is blah blah blah and serial is F1AB3CDE123   F1AB3CDE123
2        9   Serial No: f1ds6hrc813, fault is blah blah blah   f1ds6hrc813
3        3  fault: blah blah blah serial number: F2GE8FDW159   F2GE8FDW159

The \w meta-character looks for "word characters"; here it serves to look for letters (although it will also detect underscores). The \w{2} notation means I want a sequence of two such characters. The \d meta-character looks for digits, with the number in braces being the length of sequence to look for.

Starting the regular expression with (F|f) means it may start with either F or f.