Cody Cody - 1 month ago 5
HTML Question

Extracting message from emails, but returning sloppy text passages

So I've created a method to strip an email pages source code of html, style/script tags, and new line tags:

def extract_message(url):
markup = open(url)
soup = BeautifulSoup(markup, "html.parser")
for script in soup(["script", "style"]): script.extract()
text = soup.get_text()
text_clean = re.sub(r"\n", " ", text)
text_clean_more = text_clean.replace(u'\xa0', u' ')
a = text_clean_more.find('From:')
print (text_clean[a:])


Then, I have it return everything from the instance of 'From:' in the email, all the way to the end of the email. However, after it goes through this process, I get in return a very spaced out and overall sloppy passage of text such as:

enter image description here

My goal is to print out a clean passage of text. Is there anyway that I could do this? I've been wracking my brain with this for several hours now, and haven't come up with anything rational at all. Just looking for a push in the right direction, Thanks.

Answer

Use the module email to extract message bodies instead of hacking them apart by hand. Use the module textwrap to format message text into nice paragraphs. This will probably work:

paras = rawtext.split("\n\n")  # Split into paragraphs, if any
formatted = "\n\n".join(textwrap.fill(p) for p in paras)

But take a look at the textwrap documentation for keyword options you can specify.