snate snate - 1 year ago 77
Bash Question

Running Shell Script in parallel for each line in a file

I have a delimited (|) input file (TableInfo.txt) that has data as shown below


I have a shell script ( that parses each line and calls a executable passing args from the line like dbName, TableName. This process reads data from a SQL Server and loads it into HDFS.

while IFS= read -r line;do
fields=($(printf "%s" "$line"|cut -d'|' --output-delimiter=' ' -f1-))
query=$(< ../sqoop/"${fields[1]}".sql)
sh ../ProcessName "${fields[0]}" "${fields[1]}" "$query"
done < ../TableInfo.txt

Right now my process is running in sequential for each line in the file and its time consuming based on the number of entries in the file.

Is there any way I can execute the process in parallel? I have heard about using xargs/GNU parallel/ampersand and wait options. I am not familiar on how to construct and use it. Any help is appreciated.

Note:I don't have GNU parallel installed on the Linux machine. So xargs is the only option as I have heard some cons on using ampersand and wait option.

Answer Source

Put an & on the end of any line you want to move to the background. Replacing the silly (buggy) array-splitting method used in your code with read's own field-splitting, this looks something like:

while IFS='|' read -r db table; do
    ../ProcessName "$db" "$table" "$(<"../sqoop/${table}.sql")" &
done < ../TableInfo.txt

...FYI, re: what I meant about "buggy" --

fields=( $(foo) )

...performs not only string-splitting but also globbing on the output of foo; thus, a * in the output is replaced with a list of filenames in the current directory; a name such as foo[bar] can be replaced with files named foob, fooa or foor; the globfail shell option can cause such an expansion to result in a failure, the nullglob shell option can cause it to result in an empty result; etc.

If you have GNU xargs, consider the following:

# assuming you have "nproc" to get the number of CPUs; otherwise, hardcode
xargs -P "$(nproc)" -d $'\n' -n 1 bash -c '
  db=${1%|*}; table=${1##*|}
  exec ../ProcessName "$db" "$table" "$query"
  ' _ < ../TableInfo.txt