arno_v arno_v - 7 months ago 18
Python Question

Python html2text adds random \n

When using the html2text python package to convert html to markdown it adds '\n' to the text. I also see this behaviour when trying the demo at http://www.aaronsw.com/2002/html2text/

Is there any way to turn this of? Off course I can remove them myself, but there might be occurences of '\n' in the original text which I don't want to remove.

html2text('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.')

u'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo\nconsequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\ncillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non\nproident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n\n'

Answer

Looking at the source to html2text.py, it looks like you can disable the wrapping behavior by setting BODY_WIDTH to 0. Something like this:

import html2text
html2text.BODY_WIDTH = 0
text = html2text.html2text('...')

Of course, resetting BODY_WIDTH globally changes the module's behavior. If I had a need to access this functionality, I'd probably seek to patch the module, creating a parameter to html2text() to modify this behavior per-call, and provide this patch back to the author.

Comments