bjornasm bjornasm - 3 months ago 28
Python Question

Encoding error when reading url with urllib

When I try to scrape a wikipedia site with a special character in its URL, using urllib.request and Python, I get the following error

UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 23: ordinal not in range(128)


The code:

# -*- coding: utf-8 -*-
import urllib.request as ur

url = "https://no.wikipedia.org/wiki/Jonas_Gahr_Støre"
r = ur.urlopen(url).read()


How can I use urllib.request with utf-8 encoding?

Answer

Apparently, urllib can only handle ASCII requests, and converting your url to ascii gives a error on your special character. Replacing ø with %C3%B8, the proper way to encode this special character in http, seems to do the trick. However, I can't find a method to do this automatically like your browser does.

example:

>>> f="https://no.wikipedia.org/wiki/Jonas_Gahr_St%C3%B8re"
>>> import urllib.request
>>> g=urllib.request.urlopen(f)
>>> text=g.read()
>>> text[:100]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="nb" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

The answer above doesn't work, because he is encoding after the request is processed, while you get an error during the request processing.