maximegir maximegir - 3 years ago 166
Python Question

Parsing transcripts with regular expression

I have a text which format resemble this sample :


PAUL: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo >ligula eget dolor.

LEONARD: Aenean massa. Cum sociis natoque penatibus et magnis dis parturient >montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque >eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, >fringilla vel, aliquet nec, vulputate eget, arcu.

EVIL NINJA [on the roof]: In enim justo, rhoncus ut, imperdiet a, venenatis >vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. >Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. >Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim.

PAUL [SCREAMING]: Aliquam lorem ante, dapibus in, viverra quis, feugiat a, >tellus.


And a regular expression to parse the transcripts into dialogs.

[A-Z]+([:]|[ ]{1}[[][A-Z]*[]])


I am trying to capture all the locutors so that the regular expression matches

"PAUL:",
"LEONARD [some context]:"


As you can see here I have not been able to capture all of the locutors.


EVIL NINJA [on the roof]:


How can I capture the above as well ? Is regex even the right way to go for this ?

Edit : All the speakers name are in caps, and ends with a colon. This is the format in which all of the transcripts i'm dealing with is.

Answer Source

The problem with your regex is that it doesn't allow any whitespace, so it doesn't match "EVIL NINJA" or "on the roof".

But yes, regex is absolutely the right way to do this. You can try this:

([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:

Usage:

regex = r'([A-Z][A-Z ]*)(?: \[([\w ]+)\])?:'

for match in re.finditer(regex, text):
    print('person:', match.group(1))
    print('context:', match.group(2))
    print()

Output:

person: PAUL
context: None

person: LEONARD
context: None

person: EVIL NINJA
context: on the roof

person: PAUL
context: SCREAMING
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download