clear.choi clear.choi - 3 months ago 17
Linux Question

awk combine sequence with substring key

I have two files for combining with tab delimited file.
That two files key might be different only for "Reads" number.

I want to compare that two files and combined based on substring key and match.

Forexample,

File1 Key : "Cluster0_Reads255"
File2 Key : "Cluster0_Reads50"
This case is same because "Cluster0_Reads" is identical.


This case I want to combine that two column with File1 Key name.
Please check below example case.

File 1.

A B
Cluster0_Reads255 500
Cluster1_Reads253 300
Cluster2_Reads100 200
Cluster3_Reads100 350


File 2.

A C
Cluster0_Reads50 GE
Cluster1_Reads200 GA
Cluster2_Reads100 GA


Result.

A B C
Cluster0_Reads255 500 GE
Cluster1_Reads253 300 GA
Cluster2_Reads100 200 GA
Cluster3_Reads100 350 -


I made one awk with exact match find and combine like below,

awk '
BEGIN { FS = OFS = "\t" }
{key = $1}
FNR == NR {result[key] = $0; next;}
(key in result) { updated[key]=1 ; for (i=2; i <= NF; i++) result[key] = result[key] FS $i }
END {
PROCINFO["sorted_in"] = "@ind_str_asc" # if using GNU awk
for (key in result) {
if(!(key in updated)) result[key] = result[key] FS "-"
if(!(length(key)==0)) print result[key]
}
}
' File1 File2 > File3


Is there any way to combine after substring?

Thank you.

Answer

Below kinda dirty awk script does the job, but i'm sure you would find a better one ere.

awk -v FS="\t" -v OFS="\t" '
NR==FNR{f1=$1;sub(/[0-9]*$/,"",f1);file1info[f1]=$0;next}
       {sub(/[0-9]*$/,"",$1);file2info[$1]=$2}
    END{
        for(i in file1info){
        print file1info[i],(i in file2info)?file2info[i]:"-";
        }
       }' File1 File2 | expand -t 20 | sort -nk1

Output

A                   B                   C          
Cluster0_Reads255   500                 GE
Cluster1_Reads253   300                 GA
Cluster2_Reads100   200                 GA
Cluster3_Reads100   350                 -

Edit

Finally managed to get a smaller faster one. The trick was reversing the files considering that file2's clusters always form a subset of file1's.

awk -v FS="\t" -v OFS="\t" '
NR==FNR{sub(/[0-9]*$/,"",$1);file2info[$1]=$2;next}
       {f1=$1;sub(/[0-9]*$/,"",f1);print $0,(f1 in file2info)?file2info[f1]:"-"}
       ' File2 File1   | expand -t 20 |sort -nk1

Output

A                   B                   C          
Cluster0_Reads255   500                 GE
Cluster1_Reads253   300                 GA
Cluster2_Reads100   200                 GA
Cluster3_Reads100   350                 -
Comments