Dagang Dagang - 1 month ago 6
Bash Question

Shell: find files in a list

I have a list containing about 1000 file names. I want to path of these files under a base directory. There're many subdirs in the dir, and more than 1,000,000 files in the subdirs. The following command will do find for 1000 times:

cat filelist.txt | while read f; do find /dir -name $f; done


Is there faster or better way to do it?

Answer

If filelist.txt has a single filename per line:

find /dir | grep -f <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt)

(The -f option means that grep searches for all the patterns in the given file.)

Explanation of <(sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt):

The <( ... ) is called a process subsitution, and is a little similar to $( ... ). The situation is equivalent to (but using the process substitution is neater and possibly a little faster):

sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt > processed_filelist.txt
find /dir | grep -f processed_filelist.txt

The call to sed runs the commands s@^@/@, s/$/$/ and s/\([\.[\*]\|\]\)/\\\1/g on each line of filelist.txt and prints them out. These commands convert the filenames into a format that will work better with grep.

  • s@^@/@ means put a / at the before each filename. (The ^ means "start of line" in a regex)
  • s/$/$/ means put a $ at the end of each filename. (The first $ means "end of line", the second is just a literal $ which is then interpreted by grep to mean "end of line").

The combination of these two rules means that grep will only look for matches like .../<filename>, so that a.txt doesn't match ./a.txt.backup or ./abba.txt.

s/\([\.[\*]\|\]\)/\\\1/g puts a \ before each occurrence of . [ ] or *. Grep uses regexes and those characters are considered special, but we want them to be plain so we need to escape them (if we didn't escape them, then a file name like a.txt would match files like abtxt).

As an example:

$ cat filelist.txt
file1.txt
file2.txt
blah[2012].txt
blah[2011].txt
lastfile

$ sed 's@^@/@; s/$/$/; s/\([\.[\*]\|\]\)/\\\1/g' filelist.txt
/file1\.txt$
/file2\.txt$
/blah\[2012\]\.txt$
/blah\[2011\]\.txt$
/lastfile$

Grep then uses each line of that output as a pattern when it is searching the output of find.

Comments