dam4l10 dam4l10 - 6 months ago 16
Bash Question

Delete part of string in line (but not whole line) in subset of rows?

I have a tab delimited text file with 4 columns and a hundred million rows that looks like this:

chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8_KI270718v1_random 10149 10150 rs371194064
chr9_GL000221v1_random 10144 10145 rs144773400
chr10_KI270879v1_alt 10055 10055 rs768019142
chr11_KI270714v1_random 10107 10108 rs62651026


I want to delete the portion of the first column that starts with "_" from the lines that contain this. So I want the output to look like:

chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8 10149 10150 rs371194064
chr9 10144 10145 rs144773400
chr10 10055 10055 rs768019142
chr11 10107 10108 s62651026


I have tried doing this using sed (
sed 's/_\S*\s*/ /' infile > outfile
), but this only removed the "_" in lines that contain the string I wanted to remove. So it looked something like this:

chr1 10019 10020 rs775809821
chr2 10108 10109 rs376007522
chr3 10128 10128 rs796688738
chr4 10128 10128 rs796688738
chr5 10138 10139 rs368469931
chr6 10146 10147 rs779258992
chr7 10165 10165 rs796884232
chr8 KI270718v1_random 10149 10150 rs371194064
chr9 GL000221v1_random 10144 10145 rs144773400
chr10 KI270879v1_alt 10055 10055 rs768019142
chr11 KI270714v1_random 10107 10108 s62651026


How can I delete only the portion of the line from "_" onwards only in lines containing a string following "chr#" in column 1?

Answer

You can use:

awk 'BEGIN{FS=OFS="\t"} $1 ~ /chr/{sub(/_.*$/, "", $1)} 1' file

Output:

chr1   10019  10020  rs775809821
chr2   10108  10109  rs376007522
chr3   10128  10128  rs796688738
chr4   10128  10128  rs796688738
chr5   10138  10139  rs368469931
chr6   10146  10147  rs779258992
chr7   10165  10165  rs796884232
chr8   10149  10150  rs371194064
chr9   10144  10145  rs144773400
chr10  10055  10055  rs768019142
chr11  10107  10108  rs62651026
Comments