我正在处理多列数据文件的分析(这里是 10 行的示例,但真实的日志将包含 150 行!),格式如下:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
9000, lig662, 1, 0.421573, -7.8400, 153
10V2, lig662, 1, 0.42692, -8.0300, 149
3000, lig158, 1, 0.427342, -8.1900, 147
3000, lig158, 1, 0.427342, -8.1900, 147
10V2, lig342, 1, 0.432943, -9.4200, 137
10V1, lig807, 1, 0.434338, -8.0300, 147
10V2, lig369, 1, 0.440377, -7.3200, 156
10V1, lig342, 1, 0.441205, -9.4200, 135
10V1, lig369, 1, 0.465029, -7.3600, 148
10V1, lig158, 1, 0.504513, -7.4800, 135
根据这些数据,我需要关注第一列(如9000
、10V1
或3000
)以及第二列(如 ligXXX)中的索引。特别是我需要打印前三名两列中的索引以及它们在 CSV 所有行中出现的次数(从而指示两列中最常见的索引):
TOP PROT; TOP LIG
10V1 (number of cases:4), lig 158 (number of cases: 3)
10V2 (number of cases:3), lig 662 (number of cases: 2)
3000 (number of cases: 2), lig 369 (number of cases: 2)
AWK可以直接应用来计算可以排序的所选列中出现的次数等。
awk '{print $1}' file.csv | sort | uniq -c
我需要为两个列和按出现次数进行排名开发这个想法。
答案1
使用 GNU awk
gawk -F',[[:blank:]]+' -v N=3 '
{
count["prot"][$1]++
count["lig"][$2]++
}
function show(thing, n, id) {
print "TOP " toupper(thing)
n = N
for (id in count[thing]) {
printf "%s (number of cases: %d)\n", id, count[thing][id]
if (--n == 0) break
}
}
END {
PROCINFO["sorted_in"] = "@val_num_desc"
show("prot")
show("lig")
}
' file.csv | pr -2Ts$'\t' | sed 's/\t/, /'
TOP PROT, TOP LIG
10V1 (number of cases: 4), lig158 (number of cases: 3)
10V2 (number of cases: 3), lig662 (number of cases: 2)
3000 (number of cases: 2), lig369 (number of cases: 2)
答案2
使用 GNU awk 处理数组数组和sorted_in:
$ cat tst.awk
BEGIN { FS=", *"; OFS=", " }
NR > 1 {
cnts[1][$1]++
cnts[2][$2]++
}
END {
numRows = 3
numCols = 2
PROCINFO["sorted_in"] = "@val_num_desc"
for (colNr=1; colNr<=numCols; colNr++) {
rowNr = 0
for (key in cnts[colNr]) {
vals[++rowNr][colNr] = sprintf("%s (number of cases: %d)", key, cnts[colNr][key])
}
}
print "TOP PROT", "TOP LIG"
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", vals[rowNr][colNr], (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
TOP PROT, TOP LIG
10V1 (number of cases: 4), lig158 (number of cases: 3)
10V2 (number of cases: 3), lig662 (number of cases: 2)
3000 (number of cases: 2), lig369 (number of cases: 2)