Riadh Belkebir Riadh Belkebir - 16 days ago 10
Python Question

Arabic encoding error when using requests restful client

I have the following restful client in python:

import requests;
s= 'وإليك ما يقوله إثنان من هؤلاء';
resp = requests.post('http://localhost:8080/MyApp/webresources/production/sendSentence', json={'sentence': s,} )


the aforementionned code call a web service implemented in java which returns the same sentence sent from the requests client.

this is the java webservice:

@POST
@Consumes("application/json")
@Produces("text/html; charset=UTF-8")
@Path("/sendSentence")
public String sendSentence(@Context HttpServletRequest requestContext, String valentryJson) throws Exception {
try {
if (valentryJson != null) {
JSONObject jsonObject;
jsonObject = new JSONObject(valentryJson);
String sentence = jsonObject.getString("sentence");

return sentence;
}
} catch (JSONException ex) {
}
return "";
}


the problem is the encoding because when i try to print the content this is the result:

>>> resp.content

'\xd9\x88\xd8\xa5\xd9\x84\xd9\x8a\xd9\x83 \xd9\x85\xd8\xa7 \xd9\x8a\xd9\x82\xd9\x88\xd9\x84\xd9\x87 \xd8\xa5\xd8\xab\xd9\x86\xd8\xa7\xd9\x86 \xd9\x85\xd9\x86 \xd9\x87\xd8\xa4\xd9\x84\xd8\xa7\xd8\xa1'


Or when I use print:

>>> print resp.content

ظˆط¥ظ„ظٹظƒ ظ…ط§ ظٹظ‚ظˆظ„ظ‡ ط¥ط«ظ†ط§ظ† ظ…ظ† ظ‡ط¤ظ„ط§ط،

Answer

Your Java webservice produces HTML, UTF-8 encoded:

@Produces("text/html; charset=UTF-8")

but you took the raw bytes returned without decoding:

>>> resp.content

response.content gives you bytes, not Unicode text. You could use the resp.text attribute instead, which uses the charset parameter of the Content-Type header to decode your data:

>>> resp.text
u'\u0648\u0625\u0644\u064a\u0643 \u0645\u0627 \u064a\u0642\u0648\u0644\u0647 \u0625\u062b\u0646\u0627\u0646 \u0645\u0646 \u0647\u0624\u0644\u0627\u0621'
>>> print resp.text
وإليك ما يقوله إثنان من هؤلاء

Be careful however; if no charset parameter is present, but the content-type header indicates this is a text/... content type (like text/html), then requests will follow the HTTP RFCs and decode the data as Latin-1. This'll silently work but may not be the correct codec. For HTML data, use a HTML parser instead, pass in the bytestring, and leave it to the parser to extract what codec is correct (HTML often records the right encoding in a <meta> tag). See retrieve links from web page using python and BeautifulSoup.