Marco Neves Marco Neves - 14 days ago 5
Python Question

Python Regex matching and replacing

I have a pdf file with its contents formatted as follows:


00:12 There once lived a man...

00:18 who was thought to have...


and the list goes on following the same pattern. Now I'm trying to write a Regex program that will read the file and remove all of the time stamps as well as replace the line skips with spaces. In other words. I want to make one big paragraph out of it.

This is what I came up for the reg expression:

transcript.replace(transcript.matches("^[0-9:]+$"),"")


and that will get rid of any numbers and colons, meaning the time stamps. Now I'm not sure how to replace the line skips, would I do something like

transcript.replace(transcript.matches("^[\n]+$"), " ")


Any help would be appreciated. Thanks!

CJC CJC
Answer

Couldn't you just check for a blank line, skip (or delete) those lines and use your transcript code to handle the timestamps?

for line in file:
    if line == "": #test that this is how a blank line is read
       line.delete
    else:
       transcript.replace(transcript.matches("^[0-9:]+$"),"")

This may return a block of text with the following appearance

There once lived a man...

who was thought to have...

Which you still need to wrap into continuous paragraphs. Do the three dots appear at the end of each line as in your quoted text?

Comments