SharonG SharonG - 1 year ago 56
Bash Question

Shell Script to Parse Date

I am using a shell script to process a csv file with data in the following format:

yyyy-mm-dd, value

Each line has a different date and a different value.

I would like to parse each line into the following new format:

yyyy, weeknum, yyyy-mm-dd, value

Where yyyy is the 4-digit year from the date on that line, and weeknum is the week number for that day, month, and year.

I've worked out using the date command to get the weeknum, where I hard-coded the date to 2016-02-01 as an example:

echo $(date -j -f '%Y-%m-%d' '2016-02-01' '+%V')

But I'm just not sure how to incorporate this date command into something like sed where I can dynamically and globally insert the yyyy and weeknum values into each line based on the actual date value from that line in the file.

Any suggestions for how to proceed would be greatly appreciated!


Answer Source

This might do:

$ uname -sr
Darwin 15.4.0
$ cat inp
2016-01-01, 5
2016-01-09, 15
2016-02-01, 3.14
$ while IFS=", " read d v; do date -j -f '%Y-%m-%d' "$d" "+%Y, %V, %F, $v"; done < inp
2016, 53, 2016-01-01, 5
2016, 01, 2016-01-09, 15
2016, 05, 2016-02-01, 3.14

This pops everything into the format for the date command, avoiding the need for subshells or temporary variables.

Note the selection of quotes. While format strings are generally considered static, and usually placed in single quotes, if we want to include the variable $v in the format, we must use double quotes instead, allowing the variable expansion to take place. Note that if for some reason your input data in the CSV are "dirty", you could break your processing easily, as this provides no input checking besides date's ability to parse the first field.


If you were to install GNU awk (gawk) on your system using Macports or Brew, etc, then the following would likely perform better:

gawk 'BEGIN{OFS=FS=", "} {split($1,a,"-"); print a[1],strftime("%V",mktime(gensub(/-/," ","g",$1) " 00 00 00")),$1,$2}' inp

I wrote this as a one-liner, but I'll break the in points for easier explanation.

  • BEGIN { OFS=FS=", " } - at the start of the script, defines a field separator.
  • { - the main part of this awk script has no "condition", so will be executed for every line of input.
  • split($1,a,"-") - split the first field into the array a[], separated by hyphens.
  • print a[1], - print output, starting with the year,
  • strftime("%V", - followed by a time format for the week-of-year,
  • mktime(gensub(/-/," ","g",$1) " 00 00 00")) - generated from a time parsed in mktime's "datespec" format,
  • ,$1,$2} - followed by the other two fields.

I haven't developed any performance metrics, but I'm certain the self-contained gawk option would run significantly faster than the bash-based option which spawns a date command for each line of input.