bluelaser bluelaser - 1 month ago 6
Bash Question

Issue with using While loop in bash script (to split a file into multiple files)

I need to read data from a file and insert into multiple files (with each file less than 3mb in size, file sizes can be different). The important thing is - records for an Agent shouldn't be split across multiple files. I am doing all of this in a While loop in a UNIX bash script.

Input.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3

Desired Output -

Output1.csv (less than 3MB)
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3

Output2.csv (less than 3MB)
Src,AgentNum,PhoneNum
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3


Bash Shell Script

#!/bin/bash
BaseFileName=$(basename $FileName | cut -d. -f1)
Header=`head -1 $FileName`
MaxFileSize=$(( 3 * 1024 * 1024 ))

sed 1d $FileName |
while read -r line
do
echo $line >> ${BaseFileName}_${FileSeq}.csv

MatchCount=`echo $line | grep -c -E '^.DWH'`

if [[ $MatchCount -eq 1 ]]
then
FileSizeBytes=`du -b ${BaseFileName}_${FileSeq}.csv | cut -f1`
if [[ $FileSizeBytes -gt $MaxFileSize ]]
then
#Add a header record to each file
sed -i "1i ${Header}" ${BaseFileName}_${FileSeq}.csv
FileSeq=$((FileSeq + 1))
fi
fi
done


It is working almost fine except
1) It is not splitting the records as expected (some records for an Agent are split across multiple files)
2) It is inserting header record only for the first output file.
3) Too slow, it took 3 minutes for a 10MB file. In reality I have a 3GB file.

Can somebody please suggest me where I am doing it wrong.
Is there any much better way to handle this?

Answer

Here's a rough attempt -- it's not as fast as a pure-awk solution would be, but it's much, much faster than what you already had:

#!/bin/bash

# two external parameters: input file name, and max size in bytes (default to 3MB)
InputFile=$1
MaxFileSize=${2:-$(( 3 * 1024 * 1024 ))}

BaseName=${InputFile%.*} # strip extension
Ext=${InputFile##*.}     # store extension
FileSeq=0                # start output file at sequence 0

# redirect stdin from the input file, stdout to the first output file
exec <"$InputFile" || exit
exec >"${BaseName}.${FileSeq}.${Ext}" || exit

# read the header; copy it to the first output file, and initialize CurFileSize
IFS= read -r Header || exit
printf '%s\n' "$Header" || exit
CurFileSize=$(( ${#Header} + 1 ))

# ...then loop over our inputs, and copy appropriately
while IFS= read -r line; do
  if [[ $line = DWH,* ]] && (( CurFileSize > MaxFileSize )); then
    (( FileSeq++ ))
    exec >"${BaseName}.${FileSeq}.${Ext}" || exit
    printf '%s\n' "$Header" || exit
    CurFileSize=$(( ${#Header} + 1 ))
  fi
  printf '%s\n' "$line" || exit
  (( CurFileSize += ${#line} + 1 ))
done

Noteworthy changes:

  • No external tools are called at all. No sed, no basename, no du, no grep. Any time you write $() or ``, there's a very nontrivial performance cost; these constructs shouldn't be used inside a tight loop unless impossible to avoid -- and when using ksh or bash extensions to the POSIX sh standard, it's rare that they are in fact impossible to avoid.
  • Redirections are invoked only when it's necessary to open a new output file. We don't use >>"$filename" every time we want to write a line, but instead use exec >"$filename" every time we need to start a new output file.
  • Quotes are always used during parameter expansions except in contexts where string-splitting or globbing is explicitly suppressed by other syntax. Failure to do this can corrupt your files (replacing * with a list of files in the current directory, for instance; replacing tabs with spaces; etc). When in doubt, quote more.
  • Using printf '%s\n' is better-defined by the POSIX standard than echo -- see the standard definition for echo, particularly the APPLICATION USAGE section.
  • We're doing error-handling explicitly. One could also use set -e, but there are substantial caveats to its use.

Test procedure and output follow:

$ cat >input.csv <<'EOF'
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
EOF

$ ./splitCSV input.csv 100  ## split at first boundary after 100 bytes

$ cat input.0.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3

$ cat input.1.csv
Src,AgentNum,PhoneNum
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3