Shalini Baranwal Shalini Baranwal - 4 months ago 8
R Question

Regex for extracting all words between word and character

i know basic of regex performing with R. But here i have a file like :


**[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981

[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767**


I wanted to extract timestamp alongwith all the SERVICE_ID in that line.

So, my expected output is:


[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981

[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134


The code which I tried was only extracting one SERVICE_ID.

library(qdapRegex)

a <- readLines("C:\\MY_FOLDER\\vinita\\sample.txt")

testi <- rm_between(a,"SERVICE_ID",",",extract = T)

Answer

We replace the 2 or more , with " " to get 'str2', then using regex lookarounds, we match one or more space (\\s+) that follows the ]) followed by characters (.*) till the end of the string, replace it with "" so that we can extract the [2016-04..,03] part. From the 'str2', we extract the substrings "SERVICE_ID=" followed by numbers (\\d+) into a list, paste them together and finally paste it with the 'str3'.

library(stringr)
str2 <- gsub(",{2,}", " ", str1)
str3 <- sub("(?<=\\])\\s+.*", "", str2, perl = TRUE)
paste(str3, sapply(str_extract_all(str2, "SERVICE_ID=\\d+"), paste, collapse=" "))
#[1] "[2016-04-28 14:00:06,603] SERVICE_ID=441 SERVICE_ID=541 SERVICE_ID=9981"
#[2] "[2016-04-28 14:00:06,608] SERVICE_ID=00234 SERVICE_ID=11134" 

data

 str1 <- c("[2016-04-28 14:00:06,603],,,,,SERVICE_ID=441,DEBUG,DBSEntryServlet,DBSEntryServlet: delegateToRequestManager:: SERVICE_ID=541,SERVICE_ID=9981",
"[2016-04-28 14:00:06,608],,,,,,DEBUG,DBSEntryServlet,10.91.39.143:60801 SERVICE_ID=00234,SERVICE_ID=11134,IMD=6767")