John McLane John McLane -4 years ago 151
Perl Question

Sort hosts file by TLD domain, not by suffix (.com.whatever) or prefix (subdomains)

I need to sort a super huge half-mil entries list of domain names in a host file. But I want to sort them not by subdomains or by the .com .net.

So this:

www.apple3.com
this.apple4.com
that.apple1.uk
and.that.apple2.com.br


Will turn into this:

that.apple1.uk
and.that.apple2.com.br
www.apple3.com
this.apple4.com


No ips yet on this list (as all will resolve to 0.0.0.0, I'll add that later) just the domains are in the list. I'm using a Mac with brew, so I can use any GNU or nix tool for it. Thanks in advance!

Answer Source

The key to efficient sorting in Perl is to create a function that transforms the value to sort into a representative string that can be sorted lexicographically.

For example, if you want to sort dates, you could transform them into the yyyymmdd format. In this case, we'll reorder the parts of the domain such that

foo.bar.apple1.co.uk

becomes

# The sorted order of apple1.com and apple1.net will vary.
apple1.bar.foo

or

# The sorted order of apple1.com and apple1.net will be consistent.
apple1.bar.foo<NUL>uk.co

We want the numbers to be sort naturally, (1, 2, 10 rather than 1, 10, 2). Our key function could handle that, but we'll take the easy way out by replacing sort with natkeysort from Sort::Key::Natural. As a bonus, natkeysort allows us to easily integrate our key function!

The hard part is to identify the suffix. There's no rule for that, just constantly changing definitions. As such, we'll use a module to identify the suffix.


Implementing the key function using Domain::PublicSuffix:

use feature qw( state );

use Domain::PublicSuffix qw( );

sub get_sort_key {
   my ($host) = @_;
   $host =~ s/\.\z//;
   state $dps = Domain::PublicSuffix->new();
   $dps->get_root_domain($host)
      or die "$host: ".$dps->error();

   my @name   = split /\./, substr($host, 0, -length($dps->suffix())-1);
   my @suffix = split /\./, $dps->suffix();
   return join('.', reverse @name)."\0".join('.', reverse @suffix);
}

Implement the key function using IO::Socket::SSL::PublicSuffix:

use feature qw( state );

use IO::Socket::SSL::PublicSuffix qw( );

sub get_sort_key {
   my ($host) = @_;
   my @host = split(/\./, $host);
   state $ps = IO::Socket::SSL::PublicSuffix->default();
   my ($name, $suffix) = $ps->public_suffix(\@host);
   return join('.', reverse @$name)."\0".join('.', reverse @$suffix);
}

The above functions are used as follows:

use feature qw( say );

use Sort::Key::Natural qw( natkeysort );

my @hosts = (
   'www.apple3.net',
   'www.apple3.com',
   'this.apple4.com',
   'that.apple4.com',
   'www.apple10.com',
   'that.apple1.uk',
   'and.that.apple2.com.br',
);

my @sorted_hosts = natkeysort { get_sort_key($_) } @hosts;

say for @sorted_hosts;

Output:

that.apple1.uk
and.that.apple2.com.br
www.apple3.com
www.apple3.net
that.apple4.com
this.apple4.com
www.apple10.com

IO::Socket::SSL::PublicSuffix is supposedly updated more often than Domain::PublicSuffix (and Mozilla::PublicSuffix), but it's part of a larger distro.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download