Christopher Wirt Christopher Wirt - 4 years ago 204
Perl Question

A script to strip ranges of UTF-8 Characters out of a file

My problem is that I have a data file containing UTF-8, most of which is valid and must be kept, but some of which has random "garbage" UTF-8, namely in the range of

0xf0 - 0xff
. An example of the hex for the bad data can be seen below

f4 80 80 ab f4 80 80 b6 f4 80 80
a5 f4 80 80 a6 f4 80 80 83 f4 80 80 b6 f4 80 81
84 f4 80 81 98 f4 80 81 87 f4 80 81 8c f4

I'm trying to write a perl script that will search and replace for characters that the first byte is in the range
0xf0 - 0xff
. On this website the codepage is listed as private use.

My existing attempts either do nothing, or have only been able to remove the first byte of a multi-byte character, such as
perl -CSD -pi.orig -e 's/[\x{f4}-\x{ff}]/?/g'
Running perl v5.12.5

I'm not much of a perl expert, nor a utf-8 expert. I'm also open to doing this in ruby/python/C++(98)/whatever as long as it's relatively portable on a linux box.

Here's a link to a snippet of the garbage data.

Answer Source

Ok, let's not mix up a few things.

UTF-8 characters whose first byte is 0xf0 are four bytes long; those whose first byte is 0xf8 are five bytes long, 0xfc are six bytes, and so on. The prefix 0xfx doesn't map to any single code page, and certainly not to the private use areas.

Characters that take 4 or more bytes to represent in UTF-8 are outside the Basic Multilingual Plane, but that's different from being invalid or private use. It just means their code points are greater than U+FFFF (decimal value 65,535).

If you want to exclude characters outside the BMP, you should be searching for those matching the regex [\x{10000}-\x{10FFFF}], which uses Perl's \x{...} interpolation syntax to include characters by their hexadecimal code point value.

But that eliminates over 94% of possible Unicode characters. Are you sure that's what you want?

If you only want to eliminate private use characters - some of which are inside the BMP - just exclude those ranges specifically. With Perl or Python or any other UTF-8-aware language, you don't have to worry about the bytes; just use code points.

As Wikipedia will tell you, the three Private Use Areas are in these code point ranges:

  • U+E000..U+F8FF
  • U+F0000..U+FFFFF
  • U+100000..U+10FFFF

So the corresponding regex is this, using Perl syntax:


You might want to put that in a variable for easier use:

my $pua = qr([\x{e000}-\x{f8ff}\x{f0000}-\x{fffff}\x{100000}-\x{10ffff}]);

Ruby uses \u{...} instead of \x{...}:

pua = %r([\u{e000}-\u{f8ff}\u{f0000}-\u{fffff}\u{100000}-\u{10ffff}])

This Python equivalent works if you have Python3, or a Python2 compiled in wide mode:

pua = re.compile(u'[\ue000-\uf8ff\U000f0000-\U000fffff\U00100000-\U0010ffff]') 
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download