Selva Selva - 3 months ago 10
R Question

Regex to replace wiki citation in R

What would be the regex to replace citations in wikipedia article?

Example Input:

text <- "[76][note 7] just like traditional Hinduism regards the Vedas "


Expected Output:

"just like traditional Hinduism regards the Vedas"


I tried:

> text <- "[76][note 7] just like traditional Hinduism regards the Vedas "
> library(stringr)
> str_replace_all(text, "\\[ \\d+ \\]", "")
[1] "[76][note 7] just like traditional Hinduism regards the Vedas "

Answer

This should do the trick:

trimws(sub("\\[.*\\]", "",text))

Result:

[1] "just like traditional Hinduism regards the Vedas"

This pattern looks for an opening bracket (\\[), a closing bracket (\\]) and everything in between(.*).

By default .* is greedy, that is, it will try to match as much as possible, even if there are closing and opening brackets until it finds the last closing bracket. This match gets substituted by an empty string.

Finally, the trimws function will remove the spaces at the star and end of the result.

Edit: Erasing citations throughout the sentence

Should there be citations at several points in the sentence, the pattern and function changes to:

trimws(gsub(" ?\\[.*?\\] ", "",text))

For example, if the sentence was:

text1 <- "[76][note 7] just like traditional Hinduism [34] regards the Vedas "
text2 <- "[76][note 7] just like traditional Hinduism[34] regards the Vedas "

The respective results would be:

[1] "just like traditional Hinduism regards the Vedas"
[1] "just like traditional Hinduism regards the Vedas"

Pattern changes:

.*? will change the regexp from greedy to lazy. That is, it will try to match the shortest pattern until it finds the first closing bracket.

The starting ? (space + question mark) this will try to match an optional space before the opening bracket.

Comments