Marco Neves Marco Neves - 2 months ago 32
Python Question

Python Regex matching and replacing

I have a pdf file with its contents formatted as follows:


00:12 There once lived a man...

00:18 who was thought to have...


and the list goes on following the same pattern. Now I'm trying to write a Regex program that will read the file and remove all of the time stamps as well as replace the line skips with spaces. In other words. I want to make one big paragraph out of it.

This is what I came up for the reg expression:

transcript.replace(transcript.matches("^[0-9:]+$"),"")


and that will get rid of any numbers and colons, meaning the time stamps. Now I'm not sure how to replace the line skips, would I do something like

transcript.replace(transcript.matches("^[\n]+$"), " ")


Any help would be appreciated. Thanks!

CJC CJC
Answer

Couldn't you just check for a blank line, skip (or delete) those lines and use your transcript code to handle the timestamps?

for line in file:
    if line == "": #test that this is how a blank line is read
       line.delete
    else:
       transcript.replace(transcript.matches("^[0-9:]+$"),"")

This may return a block of text with the following appearance

There once lived a man...

who was thought to have...

Which you still need to wrap into continuous paragraphs. Do the three dots appear at the end of each line as in your quoted text?