houcros houcros - 3 months ago 61
Python Question

What is the encoding of the body of Gmail message? How to decode it?

I am using the Python API for Gmail. I am querying for some messages and retrieving them correctly, but the body of the messages looks like total nonsense, even when the MIME type it's said to be

text/plain
or
text/html
.

I have been searching all over the API docs, but they keep saying it's a string, when it obviously must be some encoding... I thought it could be
base64
encoding, but trying to decode it with Python
base64
gives me
TypeError: Incorrect padding
, so either it's not
base64
or I'm decoding badly.

I'd love to provide a good example, but since I'm handling sensitive information I'll have to obfuscate it a bit...

{
"payload": {
"mimeType": "multipart/mixed",
"filename": "",
"headers": [
...
],
"body": {
"size": 0
},
"parts": [
{
"mimeType": "multipart/alternative",
"filename": "",
"headers": [
{
"name": "Content-Type",
"value": "multipart/alternative; boundary=001a1140b160adc309053bd7ec57"
}
],
"body": {
"size": 0
},
"parts": [
{
"partId": "0.0",
"mimeType": "text/plain",
"filename": "",
"headers": [
{
"name": "Content-Type",
"value": "text/plain; charset=UTF-8"
},
{
"name": "Content-Transfer-Encoding",
"value": "quoted-printable"
}
],
"body": {
"size": 4067,
"data": "LS0tLS0tLS0tLSBGb3J3YXJkZWQgbWVzc2FnZSAtLS0tLS0tLS0tDQpGcm9tOiBMaW5rZWRJbiA8am9iLWFwcHNAbGlua2VkaW4uY29tPg0KRGF0ZTogU2F0LCBTZXAgMywgMjAxNiBhdCA5OjMwIEFNDQpTdWJqZWN0OiBBcHBsaWNhdGlvbiBmb3IgU2VuaW9yIEJhY2tlbmQgRGV2ZWxvcG..."
}


The field that I'm talking about is
payload.parts[0].parts[0].body.data
. I have truncated it at a random point, so I doubt is decodable like that, but you get the point... What is that encoding?

Also, wouldn't hurt to know where in the docs they explicitly say its base64 (unless it's the standard encoding for MIME?).

UPDATE: So in the end there was some bad luck involved. I have 5 mails like this, and turns out that the first one is malformed, for some unknown reason. After moving on to the other ones, I am able to decode all of them with the suggested approaches in the answers. Thank you all!

Answer

This is base64.

Your truncated message is:

---------- Forwarded message ----------
From: LinkedIn <job-apps@linkedin.com>
Date: Sat, Sep 3, 2016 at 9:30 AM
Subject: Application for Senior Backend Develop

Here's some sample code:

I had to remove the last 3 characters from your truncated message because I was getting the same padding error as you. You probably have some garbage the message you're trying to decode.

import base64

body = "LS0tLS0tLS0tLSBGb3J3YXJkZWQgbWVzc2FnZSAtLS0tLS0tLS0tDQpGcm9tOiBMaW5rZWRJbiA8am9iLWFwcHNAbGlua2VkaW4uY29tPg0KRGF0ZTogU2F0LCBTZXAgMywgMjAxNiBhdCA5OjMwIEFNDQpTdWJqZWN0OiBBcHBsaWNhdGlvbiBmb3IgU2VuaW9yIEJhY2tlbmQgRGV2ZWxv"

result = base64.b64decode(body)

print(result)

UPDATE

Here's a snippet for gettting and decoding the message body. The decoding part was taken from the gMail API documentation:

  message = service.users().messages().get(userId='me', id=msg_id, format='full').execute()
  msg_str = base64.urlsafe_b64decode(message['payload']['body']['data'].encode('UTF8'))
  mime_msg = email.message_from_string(msg_str) 

  print(msg_str)

Reference doc: https://developers.google.com/gmail/api/v1/reference/users/messages/get#python