bms9nmh bms9nmh - 3 months ago 11
Perl Question

Incorporate year matching sub-routine into script and apply condition to result

I have a script which reads a csv file line by line, and compares the title in field 2 of to another csv file. If 5 or more words match, the it prints out the line of each file which matches this criteria. Here is the script:

#!/bin/perl

#subroutine for discovering year

sub find_year {
my( $str ) = @_;
my $year = $1 if( $str =~ /\b((?:19|20)\d\d)\b/ );
return $year
}

#####CREATE CSV2 DATA

my @csv2 = ();

open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
my @csv2years;

for ( @csv2 ) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/define the data which is the title
$csv2hash{$_} = $title; # Indicate that title data will input into csv2hash.
}

###### CREATE CSV1 DATA

open CSV1, "<csv1" or die;

while (<CSV1>) {
chomp; #removes new lines

my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ creates variable of title
my %words;

$words{$_}++ for split /\s+/, $title; #/ get words

## Collect unique words into an array- the @ means an array

my @titlewords = keys(%words);

# Add exception words which shouldn't be matched.

my @new;
foreach my $t (@titlewords){
push(@new, $t) if $t !~ /^(rare|vol|volume|issue|double|magazine|mag)$/i;
}


###### The comparison algorithm

@titlewords = @new;

my $desired = 5; # Desired matching number of words
my $matched = 0;

foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};

foreach my $word (@titlewords) {
my @matches = ( $value=~/\b$word\b/ig );
my $numIncsv2 = scalar(@matches);

@matches = ( $title=~/\b$word\b/ig );

my $numIncsv1 = scalar(@matches);

++$count if $value =~ /\b$word\b/i;

if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
$count = $desired+1;
last;
}
}

if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n\n" if $matched;
}


As you can see i've created a find_year subroutine which can be used to discover if the title contains a year in the 20th or 21st century (19xx or 20xx). I asked a question a few days ago which would allow me to assign a result to a set of conditions which involve matching a year and Borodin provided a great answer here.

Perl- What function am I looking for? Assigning multiple rules to a specified outcome

I want the same conditions to apply to now, only this time the script will be comparing dates in the title of the csv's rather than standard input and a data list (as in the previous question).

What I now want to do is include this logic as a function in my word matching script so that if the condition met in my previous question are considered Pass, then perform the word matching part of the script (i.e. 5 words match). If they match the Fail condition, then skip comparing the lines and move onto the next one (i.e. don't bother with the 5 matching word element of the script). The Pass and Fail result's don't have to be printed out, I am just using these words to describe the rules of the year comparison condition in my previous question.

example for csv1:

14564564,1987 the door to the other doors,546456,47878787
456456445,Mullholland Drive is the bets film ever 1959,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545
454654564, 1939 hello good world you are great ,45456456, 54564654


example for csv2:

154465454,the other door was the door to 1949,546456,478787870
2156485754,Mullholland Drive is the bets film ever 1939,45454545,45454545
87894454,Twin Peaks forget that stuff 1984,45454545,45454545
2145678787, 1939 good lord you are great ,787425454,878777874


Current result before year_match subroutine is incorporated:

2156485754,Mullholland Drive is the best film ever 1939,45454545,45454545
456456445,Mullholland Drive is the best film ever 1959,45454545,45454545

87894454,Twin Peaks forget that stuff 1984,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545

2145678787, 1939 good lord you are great ,787425454,878777874
454654564, 1939 hello good world you are great ,45456456, 54564654


Desired result after match_year subroutine is incorporated:

87894454,Twin Peaks forget that stuff 1984,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545

2145678787, 1939 good lord you are great ,787425454,878777874
454654564, 1939 hello good world you are great ,45456456, 54564654


I can get my head around Borodin's answer to my previous question, but as this script I'm working on is difficult to read (IMO noob opinion anyway!), I'm having trouble working out how I can incorporate this new function into it.

Answer

I review algorithm. Replaced many csv2 loops to hash of words containing list of csv2 rows numbers. Preliminary check's of years no longer required.

#!/usr/bin/perl
#use Data::Dumper;
#####CREATE CSV2 DATA
open CSV2, "<csv2" or die;
my @csv2=<CSV2>;
close CSV2;
my %words2; # $words2{lower_case_word}->{csv2_row_number}->word_count
my $i=0; # csv2 row number
my %c2year; # Years of csv2 row numbers
for(@csv2) {
   chomp;
   for(split /\s+/,(split /,\s*/)[1]) { # list words in title
    $words2{lc($_)}{$i}++;
    $c2year{$i}=$_ if(/^(19|20)\d\d$/);
   }
   $i++;
}
#print Dumper(\%words2);

###### READ CSV1 DATA
my $desired = 5;      # Desired matching number of words

open CSV1, "<csv1" or die;
while (<CSV1>) {
   chomp;       #removes new lines
   my %rows=(); # $rows{csv2_row_number} => number_of_matched_words_in_row
   my $matched = 0;
   my ($title) = (split /,\s*/)[1]; #/ creates variable of title
   my %words;
   my $year=0;
####### get words and filter it
   $words{lc($_)}++ for
       grep {
         $year=$_ if(/^(19|20)\d\d$/); # Years present in word list
         !/^(rare|vol|volume|issue|double|magazine|mag)$/i
       }
       split /\s+/, $title; #/
###### The comparison algorithm
   for(keys(%words)) {
    # my $word=$_; # <-- if need count words
    if($words2{$_}) {
     for(keys(%{$words2{$_}})) {
      $rows{$_}++; # <-- OR $rows{$_}+=$words{$word} OR/AND +=$words2{$word}{$_}
     }
    }
   }
#    print Dumper(\%rows);
   for(keys(%rows)) {
      if ( ($rows{$_} >= $desired)
          && (!$year || !$c2year{$_} || $year==$c2year{$_} )
         ) {
        print "$year<=>$c2year{$_} csv2: ",$csv2[$_],"\n";
        ++$matched;
      }
   }
 print "csv1: $_\n\n" if $matched;
}

Uncomment use Data::Dumper and print Dumper(...) for hash's review.

Comments