Janaranjan Janaranjan - 14 days ago 5
Bash Question

comparing columns of files in unix?

I want to compare filenames of Today.txt with Main.txt.
If there is match, then print all 6 columns of matched file of Main.txt with a new file say matched.txt.

and the files which are not matched with Main.txt, then list the filename and time of TODAY.txt in a new file say unmatched.txt

NOTE: Plus sign(+) indicates files are from inprogress directory,sometimes filenames are appended with "+".

Main.txt

date filename timestamp space count status
Nov 4 +CHCK01_20161104.txt 06:39 2.15M 17153 on_time
Nov 4 TRIPS11_20161104.txt 09:03 0.00M 24 On_Time
Nov 4 AR02_20161104.txt 09:31 0.00M 7 On_Time
Nov 4 AR01_20161104.txt 09:31 0.04M 433 On_Time


Today.txt

filename time
CHCK01_20161104.txt 06:03
CHCK05_20161104.txt 11:10
CHCK09_20161104.txt 21:46
AR01_20161104.txt 09:36
AR02_20161104.txt 09:36
ifs01_20161104.txt 21:16
TRIPS11_20161104.txt 09:16


Required Output:
matched.txt

Nov 4 +CHCK01_20161104.txt 06:39 2.15M 17153 on_time
Nov 4 TRIPS11_20161104.txt 09:03 0.00M 24 On_Time
Nov 4 AR02_20161104.txt 09:31 0.00M 7 On_Time
Nov 4 AR01_20161104.txt 09:31 0.04M 433 On_Time


unmatched.txt

CHCK05_20161104.txt 11:10
CHCK09_20161104.txt 21:46
ifs01_20161104.txt 21:16


Below command gives me proper output except when the files are appended with plus(+) sign.

awk 'FNR==1{next}
NR==FNR{a[$1]=$2; next}
$3 in a{print; delete a[$3]}
END{for(k in a) print k,a[k] > "unmatched"}' today main > matched


Thanks a lot in advance !

Answer

The problem is the line $3 in a while running on the main file. For the string value with + to be matched, use gensub on $3 during the operation which is available in GNU awk. The importance of gensub over gsub is that it returns the value of replacement than reflecting on the file. So using it for your case as

$ awk 'FNR==1{next} 
  NR==FNR{a[$1]=$2; next} 
  gensub(/+/,"",1,$3) in a{print; delete a[gensub(/+/,"",1,$3)]} 
      END{for(k in a) print k,a[k] > "unmatched"}' today main 

Nov 4    +CHCK01_20161104.txt  06:39   2.15M  17153    on_time
Nov 4    TRIPS11_20161104.txt 09:03   0.00M  24       On_Time
Nov 4    AR02_20161104.txt    09:31   0.00M  7        On_Time
Nov 4    AR01_20161104.txt    09:31   0.04M  433      On_Time

produces the 4 lines in output as you need.

From the gawk manual page.

gensub(regexp, replacement, how [, target])
           gensub is a general substitution function. Like sub and gsub, it 
searches the target string target for matches of the regular expression regexp. Unlike sub and gsub, 
the modified string is returned as the result of the function, and the original target string
is not changed. If how is a string beginning with `g' or `G', then it replaces all matches 
of regexp with replacement. 

So in our case, gensub(/+/,"",1,$3), replaces first occurrence of + with an empty string only from in beginning of the field(since we have set replace count as 1). This is to avoid replacement in anywhere else in the field.

(or) a more neater awk logic, thanks to Ed Morton's suggestion to use gsub on $3 and store it on a variable as

$ awk 'FNR==1{next} 
  NR==FNR{a[$1]=$2; next} 
  {k=$3; sub(/^\+/,"",k)} k in a{print; delete a[k]} 
      END{for(k in a) print k,a[k] > "unmatched"}' today main