Nic Strong Nic Strong - 7 months ago 17
Java Question

International characters in filename in mutipart formdata

I am using Apache HTTP components (4.1-alpha2) to upload a files to dropbox. This is done using multipart form data. What is the correct way to encode filenames in in a multipart form that contain international (non-ascii) characters?

If I use there standard API, the server returns an HTTP status Forbidden. If I modify the upload code so the file name is urlencoded:

MultipartEntity entity = new MultipartEntity(HttpMultipartMode.BROWSER_COMPATIBLE);
FileBody bin = new FileBody(file_obj, URLEncoder.encode(file_obj.getName(), HTTP.UTF_8), HTTP.UTF_8, HTTP.OCTET_STREAM_TYPE );
entity.addPart("file", bin);
req.setEntity(entity);


The file is uploaded, but I end up with a filename that is still encoded. E.g. %D1%82%D0%B5%D1%81%D1%82.txt

Answer

To solve this issue specifically for the dropbox server I had to encode the filename in utf8. To do this I had to declare my multipart entity as follows:

MultipartEntity entity = new MultipartEntity(HttpMultipartMode.BROWSER_COMPATIBLE, null, Charset.forName(HTTP.UTF_8));

I was getting the forbidden because of the OAuth signed entity not matching the actual entity sent (it was being URL encoded).

For those interested on what the standards have to say on this I did some reading of RFCs. If the standard is strictly adhered then all headers should be encoded 7bit, this would make utf8 encoding of the filename illegal. However RFC2388 () states:

The original local file name may be supplied as well, either as a "filename" parameter either of the "content-disposition: form-data" header or, in the case of multiple files, in a "content-disposition: file" header of the subpart. The sending application MAY supply a file name; if the file name of the sender's operating system is not in US-ASCII, the file name might be approximated, or encoded using the method of RFC 2231.

Many posts mention using either rfc2231 or rfc2047 for encoding headers in non US-ASCII in 7bit. However rfc2047 explicitly states in section 5.3 encoded words MUST NOT be used on a Content-Disposition field. This would only leave rfc2231, this however is an extension and cannot be relied upon being implemented in all servers. The reality is most of the major browsers send non-US-ASCII characters in UTF-8 (hence the HttpMultipartMode.BROWSER_COMPATIBLE mode in Apache HTTP client), and because of this most web servers will support this. Another thing to note is that if you use HttpMultipartMode.STRICT on the multipart entity, the library will actually substitute non-ASCII for question mark (?) in the filename.S