ncw ncw - 11 months ago 90
HTTP Question

File extension from MIME type with ;charset=UTF-8

I have a Python web crawler which is downloading files with different extensions. To get the extension from the HTTP header content type, I am using the Python library mimetypes.

http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'])

Everything is working fine, except when the HTTP header content type contains

. E.g.
is returning
for the following examples

content-type: text/plain;charset=UTF-8 # extension should be .txt OR
content-type: text/x-c;charset=UTF-8 # extension should be .java

Check with mimetypes:

>>> import mimetypes
>>> print(mimetypes.guess_extension('text/plain;charset=UTF-8'))

Question: How do I handle this and get the correct extension from content-types ending with

I guess it is not a good solution to catch such exceptions with an if statement since I never know if the whitelist is complete or whether I am missing some content-type.

Answer Source

One simple way to deal with that is to split the MIME string and get only the first element.

The following code will return the expected result for both conditions.

http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'].split(";")[0])))

Remember it is a guess. You can't expect much from it for such broad definitions such as plain text. It seems like mimetypes.guess_extension() just takes the first element of this list. This is also the reason guessing the mimetype of text/plain returns .h when .txt is the obvious choice.