Glen Solsberry Glen Solsberry - 1 year ago 45
Perl Question

How do I determine the longest similar portion of several strings?

As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.

Example:


  • file:///home/gms8994/Music/t.A.T.u./

  • file:///home/gms8994/Music/nina%20sky/

  • file:///home/gms8994/Music/A%20Perfect%20Circle/



Ideally, I'd get back
file:///home/gms8994/Music/
, because that's the longest portion that's common for all 3 strings.

Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.

From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.

Answer Source

Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

Perl can be fast:

use strict;
use warnings;

package LCP;

sub LCP {
    return '' unless @_;
    return $_[0] if @_ == 1;
    my $i          = 0;
    my $first      = shift;
    my $min_length = length($first);
    foreach (@_) {
    	$min_length = length($_) if length($_) < $min_length;
    }
INDEX: foreach my $ch ( split //, $first ) {
    	last INDEX unless $i < $min_length;
    	foreach my $string (@_) {
    		last INDEX if substr($string, $i, 1) ne $ch;
    	}
    }
    continue { $i++ }
    return substr $first, 0, $i;
}

# Roy's implementation
sub LCP2 {
    return '' unless @_;
    my $prefix = shift;
    for (@_) {
        chop $prefix while (! /^\Q$prefix\E/);
        }
    return $prefix;
}

1;

Test suite:

#!/usr/bin/env perl

use strict;
use warnings;

Test::LCP->runtests;

package Test::LCP;

use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);

sub test_use : Test(startup => 1) {
    use_ok('LCP');
}

sub test_lcp : Test(6) {
    is( LCP::LCP(),      '',    'Without parameters' );
    is( LCP::LCP('abc'), 'abc', 'One parameter' );
    is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
    is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
    	'abcd', 'Some common prefix' );
    my @str = map { chomp; $_ } <DATA>;
    is( LCP::LCP(@str),
    	'file:///home/gms8994/Music/', 'Test data prefix' );
    is( LCP::LCP2(@str),
    	'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
    my $t = countit( 1, sub{LCP::LCP(@str)} );
    diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
    $t = countit( 1, sub{LCP::LCP2(@str)} );
    diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}

__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/

Test suite result:

1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)

That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download