Vojtech R. Vojtech R. - 7 months ago 30
Python Question

Parsing srt subtitles

I want to parse srt subtitles:

1
00:00:12,815 --> 00:00:14,509
Chlapi, jak to jde s
těma pracovníma světlama?.

2
00:00:14,815 --> 00:00:16,498
Trochu je zesilujeme.

3
00:00:16,934 --> 00:00:17,814
Jo, sleduj.


Every item into structure. With this regexs:

A:

RE_ITEM = re.compile(r'(?P<index>\d+).'
r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
r'(?P<text>.*?)', re.DOTALL)


B:

RE_ITEM = re.compile(r'(?P<index>\d+).'
r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
r'(?P<text>.*)', re.DOTALL)


And this code:

for i in Subtitles.RE_ITEM.finditer(text):
result.append((i.group('index'), i.group('start'),
i.group('end'), i.group('text')))


With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*?

How to cure this?

Thanks

Answer

The text is followed by an empty line, or the end of file. So you can use:

r' .... (?P<text>.*?)(\n\n|$)'