user2570205 user2570205 - 3 months ago 15
Perl Question

How to get the list of files that are not in another directory in perl

I have to fix a Perl script, which does the following:

# Get the list of files in the staging directory; skip all beginning with '.'
opendir ERR_STAGING_DIR, "$ERR_STAGING" or die "$PID: Cannot open directory $ERR_STAGING";
@allfiles = grep !/^$ERR_STAGING\/\./, map "$ERR_STAGING/$_", readdir(ERR_STAGING_DIR);
closedir(ERR_STAGING_DIR);


I have two directories one is
STAGING
and other is
ERROR
. STAGING contains files like
ABC_201608100000.fin
and
ERR_STAGING_DIR
contains
ABC_201608100000.fin.bc_lerr.xml
. Now the Perl script is run as a daemon process which constantly looks for the files in
ERR_STAGING_DIR
directory and processes the error files.

However, my requirement is to do not process the file if
ABC_201608100000.fin
exists in STAGING.

Question:



Is there a way , I can filter the
allfiles
array and select files which don't exist in STAGING directory?

WHAT I HAVE TRIED:



I have done programmatic way to ignore the files that exist in STAGING dir. Though it is not working.

# Move file from the staging directory to the processing directory.
@splitf = split(/.bc_lerr.xml/,basename($file));
my $finFile = $STAGING . "/" . $splitf[0];
print LOG "$PID: Staging File $finFile \n";

foreach $file(@sorted_allfiles) {
if ( -e $finFile )
{
print LOG "$PID: Staging File still exist.. moving to next $finFile \n";
next;
}
# DO THE PROCESSING.

Answer

The questions of timing aside, I assume that a snapshot of files may be processed without worrying about new files showing up. I take it that @allfiles has all file names from the ERROR directory.

Remove a file name from the front of the array at each iteration. Check for the corresponding file in STAGING and if it's not there process away, otherwise push it on the back of the array and skip.

while (@allfiles) 
{
     my $errfile = shift @allfiles;

     my ($file) = $errfile =~ /(.*)\.bc_lerr\.xml$/;

     if (-e "$STAGING/$file")
     {
          push @allfiles, $errfile;
          sleep 1;                    # more time for existing files to clear
          next;
     }
     # process the error file
}

If the processing is faster than what it takes for existing files in STAGING to go away, we would exhaust all processable files and then continuously run file tests. There is no reason for such abuse of resources, thus the sleep, to give STAGING files some more time to go away. Note that if just one file in STAGING fails to go away this loop will keep checking it and you want to add some guard against that.

Another way would be to process the error files with a foreach, and add those that should be skipped to a separate array. That can then be attempted separately, perhaps with a suitable wait.

How suitable this is depends on details of the whole process. For how long do STAGING files hang around, and is this typical or exceptional? How often do new files show up? How many files are there typically?


If you only wish to filter out the error files that have their counterparts in STAGING

my @errfiles_nostaging = grep { 
    my ($file) = $_ =~ /(.*)\.bc_lerr\.xml$/;
    not -e "$STAGING/$file";
} @allfiles;

The output array contains the files from @allfiles which have no corresponding file in $STAGING and can be readily processed. This would be suitable if the error files are processed very fast in comparison to how long the $STAGING files stay around.

The filter can be written in one statement as well. For example

grep { not -e "$STAGING/" . s/\.bc_lerr\.xml$//r }              # / or
grep { not -e "$STAGING/" . (split /\.bc_lerr\.xml$/, $_)[0] }

The first example uses the non-destructive /r modifier, available since 5.14. It changes the substitution to return the changed string and not change the original one. See it in perlrequick and in perlop.