I'm trying to extract plain text from a website using python. My code is something like this (a slightly modified version of what I found here):
from bs4 import BeautifulSoup
url = "http://www.thelatinlibrary.com/vergil/aen1.shtml"
r = requests.get(url)
k = r.content
file = open('C:\\Users\\Anirudh\\Desktop\\NEW2.txt','w')
soup = BeautifulSoup(k)
for script in soup(["Script","Style"]):
text = soup.get_text
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 8 of the file C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup], "html.parser")
Traceback (most recent call last):
File "C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py", line 12, in <module>
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 2130: character maps to <undefined>
Process finished with exit code 1
The "error" is a warning, and is of no consequence. Quieten it with
soup = BeautifulSoup(k, 'html.parser')
There seems to be a typo
script.exctract() The word
extract is spelt incorrectly.
The actual error seems to be that the content is a bytestring, but you are writing in text mode. The source contains an em dash. Handling this character is the problem.
You can encode with
soup.encode("utf-8"). This means hardcoding the encoding into your script (which is bad). Or try using binary mode for the file
open(..., 'wb'), or converting the content to a string before passing it to Beautiful Soup, using the correct encoding for that file, with
k = str(r.content,"utf-8").