我有一个这样的文件:(有 308545 行)
head output11.bim
1 1:775852:T:C 0 775852 T C
1 1:1120590:A:C 0 1120590 C A
1 1:1145994:T:C 0 1145994 C T
1 1:1148494:A:G 0 1148494 A G
1 1:1201155:C:T 0 1201155 T C
1 1:1468016:T:C 0 1468016 C T
...
另一个文件 (marker-info) 包含前 24 行注释,并以逗号分隔,如下所示(总共 500593 行):
1,742429,SNP_A-1909444,ss66079302,rs3094315,36.2,G,A,C,T,A,GCACAGCAAGAGAAAC[A/G]TTTGACAGAGAATACA,Sty,+,-,y,,,127,phs000018
1,769185,SNP_A-4303947,ss66273559,rs4040617,36.2,A,G,A,G,A,GCTGTGAGAGAGAACA[A/G]TGTCCCAATTTTGCCC,Sty,+,+,n,,,127,phs000018
1,775852,SNP_A-1886933,ss66317030,rs2980300,36.2,T,C,A,G,A,GAATGACTGTGTCTCT[C/T]TGAGTTAGTGAAGTCA,Nsp,-,+,y,,,127,phs000018
1,782343,SNP_A-2236359,ss66185183,rs2905036,36.2,C,T,C,T,A,CTCGATTTGTGTTCAA[C/T]ATATTTCATTTGTACC,Sty,-,-,n,,,127,phs000018
1,1120590,SNP_A-2205441,ss66174584,rs4245756,36.2,C,T,C,T,A,CCAGTGCTTTCAACCA[C/T]ACTCACTTTTCACTGT,Sty,+,+,n,,,127,phs000018
...
我想将 output11.bim 中的第二列替换为标记信息中的第五列,该第五列在第一列和第二列中具有匹配的值,因此对于本示例,output11.bim 的结果将如下所示:
1 rs2980300 0 775852 T C
1 rs4245756 0 1120590 C A
答案1
$ cat tst.awk
NR==FNR { map[$1,$2]=$5; next }
($1,$4) in map { $2=map[$1,$4]; print }
$ awk -f tst.awk FS=',' marker-info FS=' ' output11.bim
1 rs2980300 0 775852 T C
1 rs4245756 0 1120590 C A
或者如果您希望将 FS 设置为脚本中的 2 个单独的值:
$ cat tst.awk
BEGIN { FS="," }
NR==FNR { map[$1,$2]=$5; next }
FNR==1 { FS=" "; $0=$0 }
($1,$4) in map { $2=map[$1,$4]; print }
$ awk -f tst.awk marker-info output11.bim
1 rs2980300 0 775852 T C
1 rs4245756 0 1120590 C A