我尝试根据多个匹配列合并两个文件。我想比较文件 1 中的第 1、2、4、5 列与文件 2 中的第 1、2、4、5 列,然后将文件 3 中的匹配行与文件 1 的第 3 列和文件 2 中的所有列合并。我已经尝试了几个 awk 命令。例如:1) awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1$2$4$5]=$3;next} $1$2$4$5 in a{print $0, a[$1$2$4$5]}' file2 file1 > file3
2) awk 'NR==FNR {a[$1$2$4$5] = $3; next} $1$2$4$5 in a' file2 file1 >file3
但我得到的是空输出。有人能帮我解决这个问题吗?谢谢
我的文件示例
文件1:
CHROM,POS,ID,REF,ALT
1,69270,rs201219564,A,G
1,69428,rs140739101,T,G
1,69496,rs150690004,G,A
1,69511,rs75062661,A,G
文件2:
Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
1,69511,69511,A,C,0.031,0.450,D,0.002,0.090,B,0.001,0.040,B,-1.52,0.370,N,2.8350.402,21.5
1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
1,69511,69511,A,T,1.0,0.010,T,0.0,0.026,B,0.0,0.013,B,2.06,0.004,N,-1.639,0.014,0.002
1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
1,69496,69496,G,C,0.348,0.124,T,0.988,0.604,D,0.865,0.597,P,-2.54,0.550,D,2.6520.380,20.5
1,69496,69496,G,T,0.177,0.222,T,0.999,0.764,D,0.96,0.686,D,-2.69,0.574,D,1.4030.232,12.80
1,69428,69428,T,A,0.878,0.026,T,0.981,0.577,D,0.899,0.620,P,-0.16,0.095,N,0.0240.096,2.821
1,69428,69428,T,C,0.001,0.784,D,0.995,0.657,D,0.969,0.703,D,-5.34,0.846,D,4.5330.607,24.3
1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
所需输出:
ID,Chr,Start,End,Ref,Alt,SIFT_score,SIFT_converted_rankscore,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,PROVEAN_score,PROVEAN_converted_rankscore,PROVEAN_pred,CADD_raw,CADD_raw_rankscore,CADD_phred
rs140739101,1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
rs150690004,1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
rs75062661,1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047
答案1
您忽略的最明显的事情是您的文件以逗号分隔,但您使用默认(空格)字段分隔符。
你似乎想要的是
awk -F, '
BEGIN{OFS=FS}
NR==FNR{a[$1 FS $2 FS $4 FS $5] = $0; next}
{ind = $1 FS $2 FS $4 FS $5}
ind in a {print $3, a[ind]}
' file2 file1 > file3
(用 i 分隔字段FS
关联数组键字符串只是为了防止错误匹配;如果您只是连接字段,则无法区分"abcdef"
和"abc""def"
)。
前任。
$ awk -F, 'BEGIN{OFS=FS} NR==FNR{a[$1 FS $2 FS $4 FS $5] = $0; next} {ind = $1 FS $2 FS $4 FS $5} ind in a {print $3, a[ind]}' file2 file1
rs140739101,1,69428,69428,T,G,0.013,0.538,D,0.999,0.764,D,0.984,0.745,D,-5.05,0.827,D,4.1000.550,23.7
rs150690004,1,69496,69496,G,A,0.404,0.103,T,0.401,0.331,B,0.061,0.254,B,-1.39,0.344,N,0.9700.187,10.49
rs75062661,1,69511,69511,A,G,0.652,0.048,T,0.0,0.026,B,0.0,0.013,B,1.54,0.007,N,-0.784,0.041,0.047