nub2.0 nub2.0 - 2 months ago 6
Linux Question

Using linux to find and replace the a value in a column in a tab delimited file

I have this data below for example:

Name Chrom Position
rs1 1 1234
rs2 1 1789
rs3 1 1289
1 1 1269
2 1 1897


I would like to find all the lines in the first column that DO NOT contain the string "rs" and replace with
rs'chrom''position'
. Name 1 would look like
rs11269
.

It would end up looking like this:

Name Chrom Position
rs1 1 1234
rs2 1 1789
rs3 1 1289
rs11269 1 1269
rs11897 1 1897


I know with awk I can do

awk '!/rs/{print $1}' file


and it will find all the lines in the first column that do not contain the string "rs" but how do I take this further now to replace it with my own rs ID built from chrom and position? Would I use gsub or something else? This doesn't have to be in shell commands but also Perl is another option to use as well. Thank you for your help.

Answer

You can use this command:

$ awk 'BEGIN{FS=OFS="\t"}NR>1&&!($1~/rs/){$1="rs"$2$3}1' file
Name     Chrom  Position
rs1      1      1234
rs2      1      1789
rs3      1      1289
rs11269  1      1269
rs11897  1      1897

With BEGIN{FS=OFS="\t"} we set the input and output field separators to the tab character, with NR>1&&!($1~/rs/) we filter the lines that are not the first (header) and that do not contain the string "rs" in the first field, and with {$1="rs"$2$3} we change the value of the first field to the desired value. The final 1 is the true condition, so that all lines are printed.