Nrithya M Nrithya M - 4 years ago 128
Bash Question

Unix script or parser to delete stop words in a file

I am looking for a parser or script to remove stop words from a file.

This is the sample file:

-1.1956528741743269|ellen brown|Ellen_Brown|-3.9166730593775214|WOULD ATTORNEY FROM|||||||||||||||||||||
-2.3889038197374015|rick santorum|Rick_Santorum||CRITICIZED|||||||||||||||||||||
-1.5485422793287602|thomas jefferson|Thomas_Jefferson|-1.7299349891097682||IS LETTER TO|||||||||||||||||||||
-1.229126527004769|lewis powell|Lewis_Powell_%28conspirator%29|-3.024385187632112|IS JUSTICE OF|||||||||||||||||||||
-2.2268355006701155|michael bloomberg|Michael_Bloomberg|-2.1242762129476493|WON MAYOR OF À|||||||||||||||||||||

This is stop the word list:


I just want to remove the words from each line and not the entire line. My current script is removing these words from other words as well.

For example:

  • my line in file - "TOLD to stop using this line"

  • Stop word - "To"

  • Output - "LD sp using this line"

My file/dataset contains 70k entries.

Answer Source

The code will replace the stop words from beginning/end/in-between the column number passed in the fields variable.

fields="col_num=1“ #pass the column you want to remove stop words from

  while word i;
     cat file | 'BEGIN{'$str';'$fields'} {gsub("^'$word'[ ]|[ ]'$word'$|^'$word'$",X,$col_num); gsub("[ ]'$word'[ ]", " ",$col_num); gsub(/^ /,X,$col_num); gsub(/ $/,X,$col_num); print}' > file".temp";
     mv file".temp" file;

  done < stop_words.txt

Hope that helps!!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download