haimen haimen - 2 months ago 3x
R Question

Parsing out particular text in a big text column in a Dataframe - R

Suppose I have the following data,



For example if I want to parse out the text between abc/ and first occurrence of & alone, how do I parse out those text between these texts. I want the text between first occurence of abc/ and first occurrence of & after abc/ has occurred.

My output should be as follows,


text parsed_out
abc/1234& 1234
qwertyabc/5555& 5555
a&sdfghabc/ppp&plksa& ppp
z&xabc/lkjh&poiuw& lkjh
lkjqwefasrjabc/855698&plkjdhweb 855698

The following is my trying,

data1 = within(data, FOO<-data.frame(do.call('rbind', strsplit(as.character(text), 'abc/', fixed=TRUE))))

data2 = within(data1, FOO1<-data.frame(do.call('rbind', strsplit(as.character(FOO$X1), '&', fixed=TRUE))))

This is using too much of memory since the text file is of 8 million rows and also data2 would be having several columns because it has several '&'. Can anybody help me in parsing text between these two characters as only one column in a best efficient way so that it doesn't occupy too much of memory?

x = "thesearepresentinthestartingwhichisnotneededhttp://google.com/needstobeparsedout&reoccurencenotneeded&"

here, the function should check for http://google.com/ and parse out until first & is found. Here the output should be needstobeparsedout.

new_x = "\"http://www.google.com/search?q=erykah+badu+with+hiatus+kaiyote,+august+3&""

Why is it not working with this link?



I actually wanted to parse out few parts of the URL and for example, I want to parse out, the text between "http:www.google.com/" and first occurrence of "&".


sub(".*?https?://(?:www\\.)?google\\.com/([^&]+).*", "\\1", x)

See the regex demo.

The pattern matches:

  • (optionally add a ^ in front to match the start of string position)
  • .*? - 0+ chars as few as possible from the start till the first
  • https?:// - either https:// or http:// followed with
  • (?:www\\.)? - 1 or 0 (optional) sequence www.
  • google\\.com/ - literal text google.com
  • ([^&]+) - 1 or more chars other than & (Capture group 1)
  • .* - any 0+ chars (up to the end of string).

In the replacment pattern, \1 refers to the subtext captured into Group 1.