Jane Doe Jane Doe - 1 year ago 45
Perl Question

Perl script find and replace not working?

I am trying to create a script in Perl to replace text in all HTML files in a given directory. However, it is not working. Could anyone explain what I'm doing wrong?

my @files = glob "ACM_CCS/*.html";

foreach my $file (@files)
open(FILE, $file) || die "File not found";
my @lines = <FILE>;

my @newlines;
foreach(@lines) {
$_ =~ s/Authors Here/Authors introduced this subject for the first time in this paper./g;
#$_ =~ s/Authors Elsewhere/Authors introduced this subject in a previous paper./g;
#$_ =~ s/D4-/D4: Is the supporting evidence described or cited?/g;

open(FILE, $file) || die "File not found";
print FILE @newlines;

For example, I'd want to replace "D4-" with "D4: Is the...", etc. Thanks, I'd appreciate any tips.

Answer Source

You are using the two argument version of open. If $file does not start with "<", ">", or ">>", it will be opened as read filehandle. You cannot write to a read file handle. To solve this, use the three argument version of open:

open my $in, "<", $file or die "could not open $file: $!";
open my $out, ">", $file or die "could not open $file: $!";

Also note the use of lexical filehandles ($in) instead of the bareword file handles (FILE). Lexical filehandles have many benefits over bareword filehandles:

  1. They are lexically scoped instead of global
  2. They close when they go out of scope instead of at the end of the program
  3. They are easier to pass to functions (ie you don't have to use a typeglob reference).

You use them just like you would use a bareword filehandle.

Other things you might want to consider:

  1. use the strict pragma
  2. use the warnings pragma
  3. work on files a line or chunk at a time rather than reading them in all at once
  4. use an HTML parser instead of regex
  5. use named variables instead of the default variable ($_)
  6. if you are using the default variable, don't include it where it is already going to be used (eg s/foo/bar/; instead of $_ =~ s/foo/bar/;)

Number 4 may be very important for what you are doing. If you are not certain of the format these HTML files are in, then you could easily miss things. For instance, "Authors Here" and "Authors\nHere" means the same thing to HTML, but your regex will miss the later. You might want to take a look at XML::Twig (I know it says XML, but it handles HTML as well). It is a very easy to use XML/HTML parser.