Binarus Binarus - 7 months ago 46
Perl Question

Perl questions regarding unpack() and the v flag in printf()

I am trying to accomplish the following:

For an arbitrary Perl string (whether or not it is internally encoded in UTF-8, and whether or not it has the UTF-8 flag set), scan the string from left to right, and for every character, print the Unicode code point for that character in hex format. To make myself absolutely clear: I do not want to print UTF-8 byte sequences or something; I just would like to print the Unicode code point for every character in the string.

At first, I have come up with the following solution:

#!/usr/bin/perl -w

use warnings;
use utf8;
use feature 'unicode_strings';

binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDIN, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)');

$Text = "\x{3B1}\x{3C9}";
print $Text."\n";
printf "%vX\n", $Text;

# Prints the following to the console (the console is UTF8):
# αω
# 3B1.3C9


Then I have seen some examples, but without reasonable explanations, which made me doubt that my solution is correct, and now I have got questions regarding my own solution as well as the examples.

1) Perl's documentation about the v flag in (...)printf says:

"This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string. [...]"

It does not say what it exactly means by "a vector of integers", though. When looking at the output of my example, it seems that those integers are the Unicode code points, but I would like to have this confirmed by somebody who knows for sure.

Hence the question:

1) Can we be sure that every integer which is pulled from the string that way is the respective character's Unicode code point (and not some other byte sequence)?

Secondly, regarding an example which I have found (slightly modified; I can't remember where I got it from, maybe from the Perl docs):

#!/usr/bin/perl -w

use warnings;
use utf8;
use feature 'unicode_strings';

binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDIN, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)');

$Text = "\x{3B1}\x{3C9}";
print $Text."\n";
printf "%vX\n", $Text for unpack('C0A*', $Text);

# Prints the following to the console (the console is UTF8):
# αω
# 3B1.3C9


Being a C and assembly guy, I just don't get why somebody would write the
printf
statement like shown in the example. According to my understanding, the respective line is syntactically equivalent to:

for $_ (unpack('C0A*', $Text)) {
printf "%vX\n", $Text;
}


As far as I have understood,
unpack()
takes
$Text
, unpacks it (whatever that means in detail) and returns a list which in this case has one element, namely the unpacked string. Then $_ runs through that list with one element (without being used anywhere), hence the block (i.e. the
printf()
) is executed once. In summary, the only action which is done by the above snippet is executing
printf "%vX\n", $Text;
one time.

Hence the question:

2) What could be the reason for wrapping this into a for loop like shown in the example?

Final questions:

3) If the answer to question 1) is "yes", why do most examples I have seen use
unpack()
after all?

4) In the three line snippet above, the parentheses which surround the
unpack()
are necessary (leaving them away leads to syntax errors). In contrast, in the example, the
unpack()
does not need to be enclosed in parentheses (but it does not harm if they are added nevertheless). Could anybody explain the reason?

Edit / Update in reply to ikegami's answer below:

Of course, I know that strings are sequences of integers. But

a) There are many different encodings for those integers, and the bytes which are in a certain string's memory area depend on the encoding, i.e. if I have two strings which contain exactly the same character sequence, but I store them in memory using different encodings, the byte sequences at the strings' memory locations are different.

b) I strongly suppose that (besides Unicode) there are many other systems / standards which map characters to integers / code points. For example, the Unicode code point 0x3B1 is the Greek letter α, but in some other system, it may be the German letter Ö.

Under these circumstances, the question makes perfect sense IMHO, but I possibly should be more precise and reword it:

If I have a string
$Text
which only contains characters which are Unicode code points, and if I then execute
printf "%vX\n", $Text;
, will it print the Unicode code point in hex for every character under all circumstances, notably (but not limited to):


  • regardless of Perl's actual internal encoding of the string

  • regardless of the string's UTF-8 flag

  • whether or not
    use 'unicode_strings'
    is active



If the answer is yes, what sense do all the examples make which are using
unpack()
, notably the example above? By the way, I now have remembered where I got that one from: the original form is in Perl's
pack()
documentation, in the section about the C0 and U0 mode. Since they are using
unpack()
, there must be a good reason for doing so.

Answer
sprintf('%vX', $s)

is equivalent to

join('.', map { sprintf('%X', ord($_)) } split(//, $s))

It does not say what it exactly means by "a vector of integers"

A string is a vector of integers. Each character (element) of a string is an integer.

# A vector of integers stored in a string.
my $integers = join '', map chr, 1, 100, 1000;

# The same vector of integers stored in an array.
my @integers = 1, 100, 1000;

# Grab an integer from the string.
my $integer = ord(substr($s, $i, 1));

# Grab an integer from the array.
my $integer = $integers[$i];

Coming from a C and assembler background, this should be natural for you. One difference is that C strings have 8-bit characters, while Perl strings have 32-bit or 64-bit characters.

Can we be sure that every integer which is pulled from the string that way is the respective character's Unicode code point (and not some other byte sequence)?

The question doesn't make sense.

A character (string element) don't have a UCP.

A character can be a UCP, but it can also be any other kind of integer. It's all in how you use the string.

$s =~ /\w/        # $s is expected to contain UCPs.
decode_utf8($s)   # $s is expected to contain UTF-8 bytes.
unpack('N', $s)   # $s is expected to contain the bytes of a packed BE uint32.

In this case, if each character of the string is a UCP, then sprintf '%vX' will print those UCPs in hex.

I just don't get why somebody would write the printf statement like shown in the example.

Neither do I. for can be used as a topicalizer, meaning

for ($s) {
   s/^\s+//;
   s/\s+\z//;
}

is equivalent to

$s =~ s/^\s+//;
$s =~ s/\s+\z//;

But it's not used that way here.

In the three line snippet above, the parentheses which surround the unpack() are necessary (leaving them away leads to syntax errors). In contrast, in the example, the unpack() does not need to be enclosed in parentheses

You mention you come from a C background. Perl is just like C in this respect. Specfically,

  • The conditional or loop expression of flow control statements must be in parens. In Perl, the syntax for a foreach loop is for (EXPR) BLOCK [ continue BLOCK ].
  • A STATEMENT can be an EXPR.

For example,

while (f()) { }   # Allowed in C and Perl.
while f() { }     # Not allowed in C or Perl.

f();              # Allowed in C and Perl.
(((((f())))));    # Allowed in C and Perl.