Aidas Keburys Aidas Keburys - 26 days ago 6
Linux Question

GREP in part of line using keywords file

I need to check for multiple phrases in txt files, and if file contains them in particular line, remove the line from txt fie.

Using inverse grep with file containing phrases that needs to be removed works as a charm.

THE PROBLEM is that I need to search in part of the each line, rather than the whole line.

I need to check only part of the line until 10th comma character.
If grep finds phrase after that I want to keep the line, if grep matches before that point I want to remove the line.

I can't figure out how I could use regex alongside phrases file. Any suggestions welcome.

#!/bin/bash

shopt -s globstar

for f in /uploads/txt/original/**/*.txt ; do

grep -i -v -w -f phrase.txt "$f" > tmp
mv tmp $f

done

echo "Finished!"


EDIT

# Rule to set the flag if the line needs to be printed or not
{
ok = 1
# loop upto tenth column
for (i = 1; i <= 10; i++){
# match against each pattern
for (p in PATS) {
if ($i ~ p) {
ok = 0
}
}
}
}


Does this mean that every column is run agains PATS?

Would it be possible to merge 10 columns into one string and then check agains all patterns instead of checking 10 columns against all patterns?

Answer

Input data /tmp/test

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
foo,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, BAR,  Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, FOO,   Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, BAR,   Val11, Val12

Phrases /tmp/phrases

FOO
BAR

Awk Script with comments

#!/usr/bin/gawk -f

BEGIN {
    FS         = " *, *" # Field Separator regex to split words
    IGNORECASE = 1       # ignore case for regex match

    # read phrases file in an array
    # prepend '^' and append '$' to the phrase for exact match
    while (getline a < "/tmp/phrases") PATS["^"a"$"]
}

# Rule to set the flag if the line needs to be printed or not
{
    ok = 1
    # loop upto tenth column
    for (i = 1; i <= 10; i++){
        # match against each pattern
        for (p in PATS) {
            if ($i ~ p) {
                ok = 0
            }
        }
    }
}

# Rule to actual print if flag is set
ok {print}

# Debugging rule. Get rid for actual code.
END { for (p in PATS) print p }

# One liner
#  gawk 'BEGIN{FS=" *, *";IGNORECASE=1;while(getline a < "/tmp/phrases")PATS["^"a"$"]}{ok=1;for(i=1;i<=10;i++){for(p in PATS){if($i ~ p){ok=0}}}} ok {print}' /tmp/test

Output:

Col1, Col2, Col3, Col4, Col5, Col6, Col7, Col8, Col9, Col10, Col11, Col12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
FOO1,  Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, Val11, Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, FOO,   Val12
Val1, Val2, Val3, Val4, Val5, Val6, Val7, Val8, Val9, Val10, BAR,   Val12

Credit goes to this answer http://stackoverflow.com/a/14471194/2032943