我有一个文件A.csv
(字段分隔符 = ,
):
Lane,SampleID,Index,# Reads,# Perfect Index Reads,# One Mismatch Index Reads,# of >= Q30 Bases (PF),Mean Quality Score (PF)
1,4331099,AACCATAGAA-CCATCTCGCC,301713,291343,10370,10159753,33.99
1,4360506,GCCGCACTCT-CGAGGTCGGA,271546,261268,10278,9107606,33.91
2,4331099,AACCATAGAA-CCATCTCGCC,297033,286679,10354,10036901,34.06
2,4360506,GCCGCACTCT-CGAGGTCGGA,266222,255838,10384,8954558,33.96
3,4331099,AACCATAGAA-CCATCTCGCC,293499,281393,12106,9838039,33.89
3,4360506,GCCGCACTCT-CGAGGTCGGA,262214,249940,12274,8741308,33.78
4,4331099,AACCATAGAA-CCATCTCGCC,289896,277519,12377,9735725,33.93
4,4360506,GCCGCACTCT-CGAGGTCGGA,255337,243245,12092,8530109,33.82
我想根据 来连接该行SampleID
,打印该Index
列并对其他列求和 ( # Reads,# Perfect Index Reads,# One Mismatch Index Reads,# of >= Q30 Bases (PF),Mean Quality Score (PF)
) 并删除该列Lane
。并将结果放入另一个文件中,如下所示:
SampleID,Index,# Reads,# Perfect Index Reads,# One Mismatch Index Reads,# of >= Q30 Bases (PF),Mean Quality Score (PF)
4331099,AACCATAGAA-CCATCTCGCC,1182141,1136934,45207,39770418,135.87
4360506,GCCGCACTCT-CGAGGTCGGA,1055319,1010291,45028,35333581,135.47
你有一些awk
命令(或其他)来做到这一点吗?
我已经做了一些这样的测试,但还没有开始:
awk '{
if(NR!=2){a[$1]=$2";"a[$1]}
else print $0}
END{
n = asorti(a, b);
for (n in b) {
print b[n],a[b[n]]
}
}'
提前致谢
答案1
gawk -F, -v s=4 '
NR>1{ for(c=s; c<=NF; c++) sum_[c][$2 FS $3] +=$c; next }
{ sub(/^[^,]*,/ ,""); print }
END{ for(r in sum_[s]){
printf ("%s", r);
for(i=s; i<=NF; i++) printf (",%s", sum_[i][r]);
print ""
};
}' infile
保留订单作为原始文件:
gawk -F, -v s=5 '
NR>1{
if(!seen[$3 FS $4]) seen[$3 FS $4]=$1;
for(c=s; c<=NF; c++) sum_[c][$3 FS $4] +=$c; next
}
{ sub(/^([^,]*,){2}/ ,""); print "1 "$0 }
END{ for(r in sum_[s]){
printf ("%s %s", seen[r], r);
for(i=s; i<=NF; i++) printf (",%s", sum_[i][r]);
print ""
};
}' <(nl -s ',' -w 1 infile) |sort |cut -d' ' -f2-
或者使用datamash
您可能更喜欢的 GNU:
datamash -H -t, -s -g2,3 sum 4-8 <infile