user1769925 user1769925 - 1 month ago 15
Perl Question

Perl "out of memory" with large text file

I have a problem with the following code under the latest release of Strawberry Perl for Windows: I want to read in all text files in a directory and process their contents. I don't currently see a way to process them line by line, as some of the changes I want to make to the file contents go across newlines. The processing largely involves removing large chunks of the files (in my example code below, it is just one line, but I would ideally run a couple of similar regexes that each cut out stuff from the file)

I am running this script on a large number of files (>10,000) and it always breaks down with an "Out of memory!" message on one particular file that is larger than 400 MB. The thing is that when I write a program that ONLY processes the ONE file, the code works fine.

The machine has 8 GB RAM, so I would think that physical RAM is not the issue.

I read through other posts on memory issues, but did not find anything that would help me achieve my goal.

Can anyone suggest what I would need to change to make the program work, i.e., make it more memory efficient or somehow sidestep the issue?

use strict;
use warnings;
use Path::Iterator::Rule;
use utf8;

use open ':std', ':encoding(utf-8)';

my $doc_rule = Path::Iterator::Rule->new;
$doc_rule->name('*.txt'); # only process text files
$doc_rule->max_depth(3); # don't recurse deeper than 3 levels
my $doc_it = $doc_rule->iter("C:\Temp\");
while ( my $file = $doc_it->() ) { # go through all documents found
print "Stripping $file\n";

# read in file
open (FH, "<", $file) or die "Can't open $file for read: $!";
my @lines;
while (<FH>) { push (@lines, $_) }; # slurp entire file
close FH or die "Cannot close $file: $!";

my $lines = join("", @lines); # put entire file into one string

$lines =~ s/<DOCUMENT>\n<TYPE>EX-.*?\n<\/DOCUMENT>//gs; #perform the processing

# write out file
open (FH, ">", $file) or die "Can't open $file for write: $!";
print FH $lines; # dump entire file
close FH or die "Cannot close $file: $!";
}

Answer

Handle the file line by line:

while ( my $file = $doc_it->() ) { # go through all documents found
    print "Stripping $file\n";

    open (my $infh, "<", $file) or die "Can't open $file for read: $!";
    open (my $outfh, ">", $file . ".tmp") or die "Can't open $file.tmp for write: $!";

    while (<$infh>) {
       if ( /<DOCUMENT>/ ) {
           # append the next line to test for TYPE
           $_ .= <$infh>;
           if (/<TYPE>EX-/) {
              # document type is excluded, now loop through 
              # $infh until the closing tag is found.
              while (<$infh>) { last if m|</DOCUMENT>|; }

              # jump back to the <$infh> loop to resume
              # processing on the next line after </DOCUMENT>
              next;
           }
           # if we've made it this far, the document was not excluded
           # fall through to print both lines
       }
       print $outfh $_;
    }

    close $outfh or die "Cannot close $file: $!";
    close $infh or die "Cannot close $file: $!";
    unlink $file;
    rename $file.'.tmp', $file; 
}