我正在处理由多个 CSV 连接(通过 cat)生成的 csv:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
1000, lig40, 1, 0.805136, -5.5200, 79
1000, lig868, 1, 0.933209, -5.6100, 42
1000, lig278, 1, 0.933689, -5.7600, 40
1000, lig619, 3, 0.946354, -7.6100, 20
1000, lig211, 1, 0.960048, -5.2800, 39
1000, lig40, 2, 0.971051, -4.9900, 40
1000, lig868, 3, 0.986384, -5.5000, 29
1000, lig12, 3, 0.988506, -6.7100, 16
1000, lig800, 16, 0.995574, -4.5300, 40
1000, lig800, 1, 0.999935, -5.7900, 22
1000, lig619, 1, 1.00876, -7.9000, 3
1000, lig619, 2, 1.02254, -7.6400, 1
1000, lig12, 1, 1.02723, -6.8600, 5
1000, lig12, 2, 1.03273, -6.8100, 4
1000, lig211, 2, 1.03722, -5.2000, 19
1000, lig211, 3, 1.03738, -5.0400, 21
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V1, lig40, 1, 0.513472, -6.4600, 150
10V1, lig211, 2, 0.695981, -6.8200, 91
10V1, lig278, 1, 0.764432, -7.0900, 70
10V1, lig868, 1, 0.787698, -7.3100, 62
10V1, lig211, 1, 0.83416, -6.8800, 54
10V1, lig868, 3, 0.888408, -6.4700, 44
10V1, lig278, 2, 0.915932, -6.6600, 35
10V1, lig12, 1, 0.922741, -9.3600, 19
10V1, lig12, 8, 0.934144, -7.4600, 24
10V1, lig40, 2, 0.949955, -5.9000, 34
10V1, lig800, 5, 0.964194, -5.9200, 30
10V1, lig868, 2, 0.966243, -6.9100, 20
10V1, lig12, 2, 0.972575, -8.3000, 10
10V1, lig619, 6, 0.979168, -8.1600, 9
10V1, lig619, 4, 0.986202, -8.7800, 5
10V1, lig800, 2, 0.989599, -6.2400, 20
10V1, lig619, 1, 0.989725, -9.2900, 3
10V1, lig12, 7, 0.991535, -7.5800, 9
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V2, lig40, 1, 0.525767, -6.4600, 146
10V2, lig211, 2, 0.744702, -6.8200, 78
10V2, lig278, 1, 0.749015, -7.0900, 74
10V2, lig868, 1, 0.772025, -7.3100, 66
10V2, lig211, 1, 0.799829, -6.8700, 63
10V2, lig12, 1, 0.899345, -9.1600, 25
10V2, lig12, 4, 0.899606, -7.5500, 32
10V2, lig868, 3, 0.903364, -6.4800, 40
10V2, lig278, 3, 0.913145, -6.6300, 36
10V2, lig800, 5, 0.94576, -5.9100, 35
要对此 CSV 进行后处理,我需要 1)删除标题行的重复部分
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
仅将标题保留在融合 csv 的开头(在第一行!)
然后我需要根据第 4 列(dG(rescored))中的数字对所有行(忽略第 1 行标题)进行排序。
为了完成第一个任务,我尝试使用以下 awk 单行命令,它查找第一行,然后删除其重复项
awk '{first=$1;gsub("ID(Prot)","");print first,$0}' mycsv.csv > csv_without_repeats.csv
但是它无法识别标题行,这意味着模式定义不正确。
然后根据第 4 列的值对数据进行排序,我使用了 sort:
LC_ALL=C sort -k4,4g
如何将其通过管道传输到我的 AWK 代码或以其他方式直接由 AWK 完成所有工作?
我试过
awk '{first=$1;gsub(/ID(Prot)?(\([-azA-Z]+\))?/,"");print first,$0}' | LC_ALL=C sort -k4,4g input.csv > sorted_and_without_repeats.csv
但是脚本可以被终止,同时正确生成排序后的 CSV(由于 awk 部分的问题,仍然有重复)。
答案1
使用 GNU awk 的一种方法如下:
解析.awk
BEGIN {
# Arrays should be sorted numerically by their index
PROCINFO["sorted_in"] = "@ind_num_asc"
# Set field-separator to comma followed by optional space
FS = ",[ \t]*"
}
# Print the header
NR==1 { print; next }
# Collect lines into the `h` hashmap
NR>1 && $1 !~ /^ID/ {
h[$4] = $0
}
# Print the sorted hashmap `h`
END {
for(k in h) print h[k]
}
像这样运行:
awk -f parse.awk infile
输出:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V1, lig40, 1, 0.513472, -6.4600, 150
10V2, lig40, 1, 0.525767, -6.4600, 146
10V1, lig211, 2, 0.695981, -6.8200, 91
10V2, lig211, 2, 0.744702, -6.8200, 78
10V2, lig278, 1, 0.749015, -7.0900, 74
10V1, lig278, 1, 0.764432, -7.0900, 70
10V2, lig868, 1, 0.772025, -7.3100, 66
10V1, lig868, 1, 0.787698, -7.3100, 62
10V2, lig211, 1, 0.799829, -6.8700, 63
1000, lig40, 1, 0.805136, -5.5200, 79
10V1, lig211, 1, 0.83416, -6.8800, 54
10V1, lig868, 3, 0.888408, -6.4700, 44
10V2, lig12, 1, 0.899345, -9.1600, 25
10V2, lig12, 4, 0.899606, -7.5500, 32
10V2, lig868, 3, 0.903364, -6.4800, 40
10V2, lig278, 3, 0.913145, -6.6300, 36
10V1, lig278, 2, 0.915932, -6.6600, 35
10V1, lig12, 1, 0.922741, -9.3600, 19
1000, lig868, 1, 0.933209, -5.6100, 42
1000, lig278, 1, 0.933689, -5.7600, 40
10V1, lig12, 8, 0.934144, -7.4600, 24
10V2, lig800, 5, 0.94576, -5.9100, 35
1000, lig619, 3, 0.946354, -7.6100, 20
10V1, lig40, 2, 0.949955, -5.9000, 34
1000, lig211, 1, 0.960048, -5.2800, 39
10V1, lig800, 5, 0.964194, -5.9200, 30
10V1, lig868, 2, 0.966243, -6.9100, 20
1000, lig40, 2, 0.971051, -4.9900, 40
10V1, lig12, 2, 0.972575, -8.3000, 10
10V1, lig619, 6, 0.979168, -8.1600, 9
10V1, lig619, 4, 0.986202, -8.7800, 5
1000, lig868, 3, 0.986384, -5.5000, 29
1000, lig12, 3, 0.988506, -6.7100, 16
10V1, lig800, 2, 0.989599, -6.2400, 20
10V1, lig619, 1, 0.989725, -9.2900, 3
10V1, lig12, 7, 0.991535, -7.5800, 9
1000, lig800, 16, 0.995574, -4.5300, 40
1000, lig800, 1, 0.999935, -5.7900, 22
1000, lig619, 1, 1.00876, -7.9000, 3
1000, lig619, 2, 1.02254, -7.6400, 1
1000, lig12, 1, 1.02723, -6.8600, 5
1000, lig12, 2, 1.03273, -6.8100, 4
1000, lig211, 2, 1.03722, -5.2000, 19
1000, lig211, 3, 1.03738, -5.0400, 21