Masking problem: I need to locate and mask (i.e., replace with say "XXX") certain terms (words/expressions) from a single, large text file (input.txt, 100+ MB). The terms (10K+) that I need to locate are saved in a single file (to_mask.txt). How can I perform this in an efficient way?
I was thinking of doing this in two steps: first locate the rows that actually contain the terms
grep -Ff to_mask.txt -o -n input.txt
With sed and process substitution:
sed -f <(sed 's~^~s\~~;s~$~\~XXX\~~' to_mask.txt) input.txt
To edit the file in place, add the -i
option:
sed -i -f <(sed 's~^~s\~~;s~$~\~XXX\~~' to_mask.txt) input.txt
Explanation:
All strings in to_mask.txt
are formatted to a sed substitution command that replace string with XXX
.
Using process substitution, this internal sed is sent as a file to the -f
option of the external sed applied to input.txt
.
~
is used here as the sed delimiter and can be replaced that any other character not present in the to_mask.txt
.