Jianli Cheng Jianli Cheng - 3 months ago 11
Python Question

Python regex - find patterns in a file and put them in a list

I am trying to use regex to find all the matched patterns in a BibTex file. The file looks like this:

bib_file = """
@article{Fu_2007_ssr,
doi = {10.1016/j.surfrep.2007.07.001}
}

@article{Shibuya_2007_apl,
doi = {10.1063/1.2816907}
}
"""


My goal is to find all the matched patterns with is from
@article
to
}
and put these patterns into a list. So my final list will be like this:

['@article{Fu_2007_ssr,\n doi = {10.1016/j.surfrep.2007.07.001}\n }',
'@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n }']


Currently, I have my code:

rx_sequence = re.compile(r'(@article(.*)}\n)', re.DOTALL)
article = rx_sequence.search(bib_file).group(1)


But the
article
is a string, how can I find each matched pattern and append it to a list?

Answer

You can match all these articles with

r"(@article.*?\n[ \t]*}[ \t]*)(?:\n|$)"

(to be used with re.DOTALL modifier for the . to match any char incl. a newline). See the regex demo

Pattern details:

  • (@article.*?\n[ \t]*}[ \t]*) - Group 1 capturing a sequence of:
    • @article - a literal text @article
    • .*? - any zero or more chars, as few as possible, up to the first...
    • \n[ \t]*}[ \t]* - newline, followed with 0+ spaces/tabs, } and again 0+ spaces/tabs and...
  • (?:\n|$) - either a newline (\n) or end of string ($).

Python demo:

import re
p = re.compile(r'(@article.*?\n[ \t]*}[ \t]*)(?:\n|$)', re.DOTALL)
s = "@article{Fu_2007_ssr,\ndoi = {10.1016/j.surfrep.2007.07.001}\n}\n\n@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n}"
print(p.findall(s))
# => ['@article{Fu_2007_ssr,\ndoi = {10.1016/j.surfrep.2007.07.001}\n}',
#     '@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n}']

Note that unrolling the pattern as

@article.*(?:\n(?![ \t]*}[ \t]*(?:\n|$)).*)*\s*}

will make it more robust. See another regex demo and a Python demo (this regex does not require a re.DOTALL modifier).