vivasra vivasra - 3 months ago 24
Perl Question

Finding and replacing from a large file

Masking problem: I need to locate and mask (i.e., replace with say "XXX") certain terms (words/expressions) from a single, large text file (input.txt, 100+ MB). The terms (10K+) that I need to locate are saved in a single file (to_mask.txt). How can I perform this in an efficient way?

I was thinking of doing this in two steps: first locate the rows that actually contain the terms

grep -Ff to_mask.txt -o -n input.txt


Next go through the output and do the actual replacing (term -> "XXX").

This seems a bit tedious, can it be done in a smarter way?

Any combination of basic commands (grep, sed, awk, one-line-perl) are welcome!

Answer

With sed and process substitution:

sed -f <(sed 's~^~s\~~;s~$~\~XXX\~~' to_mask.txt) input.txt

To edit the file in place, add the -i option:

sed -i -f <(sed 's~^~s\~~;s~$~\~XXX\~~' to_mask.txt) input.txt

Explanation:

All strings in to_mask.txt are formatted to a sed substitution command that replace string with XXX.

Using process substitution, this internal sed is sent as a file to the -f option of the external sed applied to input.txt.

~ is used here as the sed delimiter and can be replaced that any other character not present in the to_mask.txt.

Comments