user3202825 user3202825 - 1 year ago 57
Perl Question

How to retrieve linkedin profiles via api for accented urls?

I'm trying to get info from LinkedIn API, but i run into some issues when the urls have any kind of accented characters.

For non-accented urls the call to the API works fine and i can retrieve data without problems, but when i try with accented urls i get an error.

I have tried escaping the url but it doesn't work:


'' . uri_escape_utf8('xxxxx');


'' . uri_escape('xxxxx');

no escaping:


double escape:

uri_escape_utf8('' . uri_escape_utf8('xxxxx'));

Answer Source


I'm pretty sure the problem will be that you don't have use utf8 at the top of your program. This code correctly encodes the i-diaresis as %C3%AF and the e-acute as %C3%A9

use utf8;
use strict;
use warnings qw/ all FATAL /;
use feature 'say';

use URI::Escape qw/ uri_escape_utf8 /;

say uri_escape_utf8('ïs-thévoz-b070838');


Whereas without the use utf8, Perl is seeing the UTF-8-encoded bytes instead of characters, like this


and uri_escape_utf8 double-encodes "\xC3\xAF" as %C3%83%C2%AF and "\xC3\xA9" as %C3%83%C2%A9 like this


so the LinkedIn server gets confused

URLs use only eight-bit octets and there is no assumed encoding for Unicode characters

RFC 3986 is the current standard for Uniform Resource Identifiers (URIs), and Section 2 -- Characters -- explains that the only characters allowed in a URL are the special delimiters !, #, $, &, ', (, ), *, +, ,, /, :, ;, =, ?, @, [, ] in addition to the unreserved characters that can be used to build identifiers which match the regex pattern [0-9A-Za-z._~-]

You can extend this restriction by using the percent sign % followed by two hex digits to represent any octet without its special meaning, but this doesn't cover multi-byte characters, and there is no implied encoding if they are used within a URL.

If you are using URI::Escape then uri_escape_utf8 will correctly encode any string in UTF-8 as a combination of unreserved and percent-encoded characters, but the server must be expecting a utf-8-encoded URL

The most likely problems are

  • Your original string is already encoded and contains encoded bytes instead of characters, so uri_escape_utf8 is encoding an encoded string

  • The LinkedIn API doesn't expect UTF-8-encoded URLs