Probie Probie - 1 month ago 8
Bash Question

Using Bash to cURL a website and grep for keywords

I'm trying to write a script that will do a few things in the following order:

  1. cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.

  2. For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.

  3. I want to finish by printing to the terminal in the following format (or something similar):

    $URL (that contained match) : $keyword (that made the match)

It needs to be able to run in Ubuntu (GNU grep, etc.)

It does not need to be cURL and grep; as long as the functionality is there.

So far I've got:

keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"

But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.

How can I go about accomplishing this task?



Here's how I would do it:

while IFS= read -r url; do
    curl -L -s "$url" | grep -ioF "$keywords" |
        while IFS= read -r keyword; do
            echo "$url: $keyword"
done < ./url_list.txt

What did I change:

  • I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
  • I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
  • I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
  • I added the -o flag for grep so that it outputs only the matched keyword.
  • I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.