Liam Liam - 5 months ago 24
LaTeX Question

How to extract content between a prefix and a suffix?

I want to extract text from {inside} the curly brackets. The differences between those texts are the prefixes, such as

to categorize everything accordingly. And every end needs to be set by the next closed curly bracket

file = "This is a string of an \section{example file} used for \subsection{Latex} documents."

# These are some Latex commands to be considered:

heading_1 = "\\\\section{"
heading_2 = "\\\\subsection{"

# This is my attempt.

for letter in file:
print("The current letter: " + letter + "\n")

I want to process a Latex file by using Python to convert it for my database.


If you just want the pairs (section-level, title) for all the file you can use a simple regex:

import re

codewords = [
    # add other here if you want to

regex = re.compile(r'\\({})\{{([^}}]+)\}}'.format('|'.join(re.escape(word) for word in codewords)))

Sample usage:

In [15]: text = '''
    ...: \section{First section}
    ...: \subsection{Subsection one}
    ...: Some text
    ...: \subsection{Subsection two}
    ...: Other text
    ...: \subsection{Subsection three}
    ...: Some other text
    ...: Also some more text \texttt{other stuff}
    ...: \section{Second section}
    ...: \section{Third section}
    ...: \subsection{Last subsection}
    ...: '''

In [16]: regex.findall(text)
[('section', 'First section'),
 ('subsection', 'Subsection one'),
 ('subsection', 'Subsection two'),
 ('subsection', 'Subsection three'),
 ('section', 'Second section'),
 ('section', 'Third section'),
 ('subsection', 'Last subsection')]

By changing the value of the codewords list you'll be able to match more kind of commands.

To apply this to a file simply read() it first:

with open('myfile.tex') as f:

If you have the guarantee that all those commands are on the same line then you can be more memory efficient and do:

with open('myfile.tex') as f: results = [] for line in f: results.extends(regex.findall(line))

Or if you want to be a bit more fancy:

from itertools import chain

with open('myfile.tex') as f:
    results = chain.from_iterable(map(regex.findall, f))

Note however that if you have something like:

\section{A very 
    long title}

This will fail, why the solution using read() will get that section too.

In any case you have to be aware that the slightest change in format will break these kind of solutions. As such for a safer alternative you'll have to look for a proper LaTeX parser.

If you want to group together the subsections "contained" in a given section you can do so after obtaining the result with the above solution. You have to use something like itertools.groupby.

from itertools import groupby, count, chain

results = regex.findall(text)

def make_key(counter):
    def key(match):
        nonlocal counter
        val = next(counter)
        if match[0] == 'section':
            val = next(counter)
        counter = chain([val], counter)
        return val
    return key

organized_result = {}

for key, group in groupby(results, key=make_key(count())):
    _, section_name = next(group)
    organized_result[section_name] = section = []
    for _, subsection_name in group:

And the final result will be:

In [12]: organized_result
{'First section': ['Subsection one', 'Subsection two', 'Subsection three'],
 'Second section': [],
 'Third section': ['Last subsection']}

Which matches the structure of the text at the beginning of the post.

If you want to make this extensible using the codewords list things will get quite a bit more complex.