我有一个文件,报告某件事的每个事件及其大小。我试图通过合并连续的事件来总结该信息,这些事件可以通过上面一行的第 3 列与下面一行的第 2 列相同来识别(只要第 1 列相同),输出应该是相同的格式,而不是任何连续字符串的多行,我会将其替换为单行,其中第 2 列将是该列中分组中的最低数字,第 3 列将是该列中分组中的最高数字,第 4 列将是该分组第 4 列中所有数字的平均值(四舍五入到最接近的整数)。
需要明确的是,第一列是特定组,第二列是开始位置,第三列是结束位置,第四列是出现次数。
每行的列组合都是唯一的,列按第 1 列、第 2 列、第 3 列进行预排序。第 2 列中的数字不应该/不会相同,第 3 列中的数字不应该/不会相同是相同的。第一列可以而且通常会是相同的。
如果可能的话,我试图在 awk 中实现这一点,我已经尝试过但失败了。这是我尝试过的:
awk 'BEGIN {OFS=","} NR==1 {print} NR>1 {if ($1==prev && $2==end+1) {sum+=$4; count++; end=$3} else {if (NR>2) {print prev, start, end, int(sum/count+0.5);}; prev=$1; start=$2; end=$3; sum=$4; count=1}} END {print prev, start, end, int(sum/count+0.5)}'
###### reformatted via "awk -o- '...script_body...'"
awk '
BEGIN {
OFS = ","
}
NR == 1 {
print
}
NR > 1 {
if ($1 == prev && $2 == end + 1) {
sum += $4
count++
end = $3
} else {
if (NR > 2) {
print prev, start, end, int(sum / count + 0.5)
}
prev = $1
start = $2
end = $3
sum = $4
count = 1
}
}
END {
print prev, start, end, int(sum / count + 0.5)
}'
输入示例:
fgh1,45513382,45513383,43
fgh1,45513383,45513384,44
fgh1,45513384,45513385,44
fgh1,45513385,45513386,43
fgh1,45513386,45513387,43
fgh1,45513387,45513388,44
fgh2,63543512,63543513,44
fgh2,63543513,63543514,41
fgh2,63543514,63543515,44
fgh2,63543515,63543516,44
输出示例:
fgh1,45513382,45513388,44
fgh2,63543512,63543516,43
答案1
添加几行输入,其中第 1 列相同,但第三列和下一行的第二列之间有间隙:
$ cat raw.dat
fgh1,45513382,45513383,43
fgh1,45513383,45513384,44
fgh1,45513384,45513385,44
fgh1,45513385,45513386,43
fgh1,45513386,45513387,43
fgh1,45513387,45513388,44
fgh2,63543512,63543513,44
fgh2,63543513,63543514,41
fgh2,63543514,63543515,44
fgh2,63543515,63543516,44 # 3rd column (current line) does not match ...
fgh2,63543524,63543525,20 # 2nd column (next line)
fgh2,63543525,63543526,60
笔记:从OP的描述和样本数据来看(对我来说)并不清楚这种情况是否会发生;如果它不能发生,那么建议的代码应该仍然有效(并且|| $2 != outcols[3]
可以删除)
一般的做法:
- 跟踪数组中的下一组输出列
- 当条件不匹配时,打印数组,清除数组,然后开始用下一行输出填充数组
一个awk
想法:
awk '
BEGIN { FS = OFS = "," }
function print_line () {
if (outcols[1]) { # if we have something in outcols[1] then ...
for (i=1;i<=3;i++) # loop through 1st three columns and ...
printf "%s%s", outcols[i], OFS # print to stdout then ...
printf "%.0f\n", (sum / count) # calculate/print average; let printf/.0f do the rounding
}
delete outcols # clear array
sum = count = 0 # reset counters
}
($1 != outcols[1]) ||
($2 != outcols[3]) { print_line() }
{ if (! outcols[1]) { # if nothing in outcols[1] then initialize 1st two output columns ...
outcols[1] = $1
outcols[2] = $2
}
outcols[3] = $3
sum += $4
count++
}
END { print_line() } # flush last line to stdout
' raw.dat
这会生成:
fgh1,45513382,45513388,44
fgh2,63543512,63543516,43
fgh2,63543524,63543526,40
答案2
我感谢每个人的尝试,有些人非常接近,它帮助我重新思考并将我自己的答案放在一起。
即使当我尝试输入时,以下解决方案似乎也有效:
awk 'BEGIN {
FS = ","
OFS = ","
}
{
if (NR == 1) {
group = $1
start = $2
end = $3
sum = $4
count = 1
} else if ($1 == group && $2 == end) {
end = $3
sum += $4
count++
} else {
printf("%s,%d,%d,%d\n", group, start, end, int((sum + count / 2) / count))
group = $1
start = $2
end = $3
sum = $4
count = 1
}
}
END {
printf("%s,%d,%d,%d\n", group, start, end, int((sum + count / 2) / count))
}'
原始问题输入:
fgh1,45513382,45513383,43
fgh1,45513383,45513384,44
fgh1,45513384,45513385,44
fgh1,45513385,45513386,43
fgh1,45513386,45513387,43
fgh1,45513387,45513388,44
fgh2,63543512,63543513,44
fgh2,63543513,63543514,41
fgh2,63543514,63543515,44
fgh2,63543515,63543516,44
原始问题输出:
fgh1,45513382,45513388,44
fgh2,63543512,63543516,43
输入2:
fgh1,45513382,45513383,43
fgh1,45513383,45513384,44
fgh1,45513384,45513385,44
fgh1,45513385,45513386,43
fgh1,45513386,45513387,43
fgh1,45513387,45513388,44
fgh2,63543512,63543513,44
fgh2,63543513,63543514,41
fgh2,63543514,63543515,44
fgh2,63543515,63543516,44
fgh2,63543524,63543525,20
fgh2,63543525,63543526,60
输出2:
fgh1,45513382,45513388,44
fgh2,63543512,63543516,43
fgh2,63543524,63543526,40
答案3
$1
使用任何 awk 并且一次只在内存中存储一个值:
$ cat tst.awk
BEGIN { FS=OFS="," }
($1 != prev[1]) || ($2 != prev[3]) {
prt()
beg = $2
cnt = sum = 0
}
{
end = $3
sum += $4
cnt ++
split($0,prev)
}
END { prt() }
function prt() {
if ( cnt ) {
print prev[1], beg, end, int( (sum / cnt) + 0.5 )
}
}
$ awk -f tst.awk orig_input
fgh1,45513382,45513388,44
fgh2,63543512,63543516,43
$ awk -f tst.awk input2
fgh1,45513382,45513388,44
fgh2,63543512,63543516,43
fgh2,63543524,63543526,40
上述假设您想要对.5
s 进行四舍五入。
答案4
未经测试,但我的解决方案看起来像......
{
AGG=$2 "," $3;
if ( AGG == PREVAGG ) {
TOT+=$4;
COUNT+=1;
} else {
if (PREVAGG) {
print PREVAGG "," int(TOT/COUNT);
}
TOT=$4;
COUNT=1;
PREVAGG=AGG;
}
}
END { print PREVAGG "," int(TOT/COUNT); }
如果可能的话,我尝试在 awk 中对此进行排序
您指的是输入流中记录的顺序吗?如果是这样,awk 就不太擅长了。尝试sort -k 1,2,3 -t','
看起来您的代码正在处理文件中的标题行,但这不在您的预期输出中。我已经忽略了这一点。
怎么运行的
AGG=$2 "," $3
- 当 $2 和 $3 变化时,我们输出并重置累积数据。将它们视为复合实体可以节省一些代码并降低复杂性。它还使得考虑其他列变得微不足道(您没有指定应如何处理 S1 中的值)。
if ( AGG == PREVAGG ) {
- 我们在此输入行中聚合的标签与前一个输入行中的标签相同吗?
- 如果是,则继续汇总数据
- 如果不是,则输出聚合数据,重置累加器和最后一行标签,但仅当这不是第一个输入记录时
脚本到达最后一条记录后,$2,$3 没有变化来触发数据输出 - 因此这是在 END 块中显式设置的。