user3202825 user3202825 - 6 months ago 27
Perl Question

How to retrieve linkedin profiles via api for accented urls?

I'm trying to get info from LinkedIn API, but i run into some issues when the urls have any kind of accented characters.

For non-accented urls the call to the API works fine and i can retrieve data without problems, but when i try with accented urls i get an error.

I have tried escaping the url but it doesn't work:

uri_escape_utf8:

'https://api.linkedin.com/v1/people/url=' . uri_escape_utf8('xxxxx');


uri_escape:

'https://api.linkedin.com/v1/people/url=' . uri_escape('xxxxx');


no escaping:

'https://api.linkedin.com/v1/people/url=xxxxx';


double escape:

uri_escape_utf8('https://api.linkedin.com/v1/people/url=' . uri_escape_utf8('xxxxx'));

Answer

Update

I'm pretty sure the problem will be that you don't have use utf8 at the top of your program. This code correctly encodes the i-diaresis as %C3%AF and the e-acute as %C3%A9

use utf8;
use strict;
use warnings qw/ all FATAL /;
use feature 'say';

use URI::Escape qw/ uri_escape_utf8 /;

say uri_escape_utf8('http://linkedin.com/in/anaïs-thévoz-b070838');

output

http%3A%2F%2Flinkedin.com%2Fin%2Fana%C3%AFs-th%C3%A9voz-b070838

Whereas without the use utf8, Perl is seeing the UTF-8-encoded bytes instead of characters, like this

"http://linkedin.com/in/ana\xC3\xAFs-th\xC3\xA9voz-b070838"

and uri_escape_utf8 double-encodes "\xC3\xAF" as %C3%83%C2%AF and "\xC3\xA9" as %C3%83%C2%A9 like this

output

http%3A%2F%2Flinkedin.com%2Fin%2Fana%C3%83%C2%AFs-th%C3%83%C2%A9voz-b070838

so the LinkedIn server gets confused



URLs use only eight-bit octets and there is no assumed encoding for Unicode characters

RFC 3986 is the current standard for Uniform Resource Identifiers (URIs), and Section 2 -- Characters -- explains that the only characters allowed in a URL are the special delimiters !, #, $, &, ', (, ), *, +, ,, /, :, ;, =, ?, @, [, ] in addition to the unreserved characters that can be used to build identifiers which match the regex pattern [0-9A-Za-z._~-]

You can extend this restriction by using the percent sign % followed by two hex digits to represent any octet without its special meaning, but this doesn't cover multi-byte characters, and there is no implied encoding if they are used within a URL.

If you are using URI::Escape then uri_escape_utf8 will correctly encode any string in UTF-8 as a combination of unreserved and percent-encoded characters, but the server must be expecting a utf-8-encoded URL

The most likely problems are

  • Your original string is already encoded and contains encoded bytes instead of characters, so uri_escape_utf8 is encoding an encoded string

  • The LinkedIn API doesn't expect UTF-8-encoded URLs