如何根据时间间隔对变量进行分组?

如何根据时间间隔对变量进行分组?

我的fb.csv文件如下:

"Source","Time"  
"192.168.137.174","120025"
"10.0.138.163","120525"
"157.240.10.13","121036"
"157.240.10.13","122536"
"157.240.10.23","123041"
"157.240.10.23","123241"
"10.0.138.163","123352"
"192.168.137.174","123952"
"157.240.10.18","124152"
"157.240.10.18","124252"
"157.240.10.23","125653"
"157.240.10.23","130053"
"192.168.137.174","130102"
"10.0.138.163","130302"
"192.168.137.174","131007"
"192.168.137.174","131352"
"157.240.10.18","132552"
"157.240.10.18","132752"
"157.240.10.23","132953"
"157.240.10.23","133253"
"192.168.137.174","133502"
"10.0.138.163","134002"
"192.168.137.174","134507"
"192.168.137.174","135752"
"157.240.10.18","140052"
"157.240.10.18","140552"
"157.240.10.23","140653"
"157.240.10.23","141053"
"192.168.137.174","141402"
"10.0.138.163","141702"
"192.168.137.174","142707"

我想按每 003000(30 分钟)的时间间隔对“来源”进行分组(计数)。

期望输出示例:

"Time Interval","Count of Sources"
"120000","4"
"123000","7"
"130000","8"
"133000","5"
"140000","7"

这个问题有解决办法吗?谢谢。

答案1

我们只需使用以下方法即可完成此操作awk

awk -F, 'BEGIN{print"\"Time Interval\",\"Count of Sources\""}NR>1{gsub(/"/,"",$2);h=int($2/10000)*10000;m=int(($2-h)/3000)*3000;ctr[h+m]++}END{n=asorti(ctr,idx);for(i=1;i<=n;i++){print "\""idx[i]"\",\""ctr[idx[i]]"\""}}' fb.csv

对于给定的输入文件,fb.csv这将导致输出

"120000","4"
"123000","7"
"130000","8"
"133000","5"
"140000","7"

重要的:这需要 GNU AWK ( gawk) 才能运行,因为它使用asorti(...)函数按索引对关联数组进行排序。它不适用于。您可以使用mawk找出您的默认版本。awkawk -Wv


命令解释:

我们在这里像这样运行awk,设置分隔列的字段分隔符,并使用文件fb.csv作为输入:

awk -F, '<COMMAND>' fb.csv

经过正确格式化后,命令awk<COMMAND>上面的占位符)如下:

BEGIN {
    print "\"Time Interval\",\"Count of Sources\""
}
NR>1 {
    gsub(/"/, "", $2)
    h = int($2 / 10000) * 10000
    m = int(($2-h) / 3000) * 3000
    ctr[h+m]++
}
END {
    n = asorti(ctr, idx)
    for(i=1; i<=n; i++) {
        print "\"" idx[i] "\",\"" ctr[idx[i]] "\""
    }
}

这看起来非常复杂(我不能否认它需要一些思考才能理解),所以我将尝试将其分解一下:

在读取文件的第一行输入之前,代码BEGIN { ... }块将执行一次。然后,对于除第一行(“行号大于 1”)之外的每一行,NR>1 { ... }执行该块。最后,在读取所有输入后,END { ... }将运行该块。

  • 现在该BEGIN块非常简单,它只打印新的 CSV 标题行。

  • 让我们看一下NR>1块。请记住,awk将每行拆分为字段,这些字段由字段分隔符(我们使用,参数设置的分隔符-F)分隔。第一列/字段将存储在变量中$1,第二列/字段将存储在变量中$2,依此类推。我们只对第二个字段的值感兴趣,该字段包含时间。

    使用该gsub(<PATTERN>, <REPLACEMENT>, <VARIABLE>)函数,我们将所有出现的<PATTERN>(用斜杠括起来的正则表达式,这里它只匹配引号)替换为<REPLACEMENT>(空的,因为我们想删除它们)字符串<VARIABLE>$2即此处包含时间的第二个字段)。

    接下来,我们将时间戳解码为整小时h(乘以 10000)和整半小时m(不带整小时;乘以 3000)。我们使用关联数组ctr作为四舍五入时间戳h+m在输入中出现频率的计数器。

  • 最后END,我们打印出计数器的值已排序通过四舍五入的时间戳索引。

答案2

我确信还有更优雅的方式,但我的建议是创建一个可执行文件,我们称之为counter.bash,其脚本内容如下:

#!/bin/bash

echo '"Time Interval","Count of Sources"'
FILTRED=$(tail -n +2 "$1" | sed -e 's/^.*\,//' -e 's/"//g' | sort)
T=1

while [ $T -lt 24 ]; do

        ((T++)); R1=0; R2=0

        for i in $FILTRED; do

                hour=${i::-4}; minute=${i:2:-2}

                if [ "$T" -lt "10" ]; then TT="0${T}"; else TT="${T}"; fi

                if [ "$minute" -lt "30" ]; then
                        if [ "$hour" == "$TT" ]; then ((R1++)); fi
                else
                        if [ "$hour" == "$TT" ]; then ((R2++)); fi
                fi
        done

        if [ "$R1" -ne "0" ]; then echo "\"${TT}0000\",\"$R1\""; fi
        if [ "$R2" -ne "0" ]; then echo "\"${TT}3000\",\"$R2\""; fi
done

然后运行:

./counter.bash fb.csv

如果结果足够,则将输出重定向到新文件:

./counter.bash fb.csv > fb.counted.csv

我比较了 Byte Commander 的答案和此脚本的性能,两者都应用于同一个不太大的文件。它们根本无法比较:

$ cat fb.csv | wc -l
3304


$ time awk -F, 'BEGIN{print"\"Time Interval\",\"Count of Sources\""}NR>1{gsub(/"/,"",$2);h=int($2/10000)*10000;m=int(($2-h)/3000)*3000;ctr[h+m]++}END{n=asorti(ctr,idx);for(i=1;i<=n;i++){print "\""idx[i]"\",\""ctr[idx[i]]"\""}}' fb.csv

"Time Interval","Count of Sources"
"120000","672"
"123000","672"
"130000","560"
"133000","560"
"140000","839"

real    0m0.017s
user    0m0.012s
sys     0m0.000s


$ time ./counter.bash fb.csv

"Time Interval","Count of Sources"
"120000","672"
"123000","672"
"130000","560"
"133000","560"
"140000","839"

real    0m2.374s
user    0m2.368s
sys     0m0.000s

当文件非常大时,我的脚本会在很长一段时间后崩溃(在不同的行上):

$ cat fb.csv | wc -l
9303745


$ time awk -F, 'BEGIN{print"\"Time Interval\",\"Count of Sources\""}NR>1{gsub(/"/,"",$2);h=int($2/10000)*10000;m=int(($2-h)/3000)*3000;ctr[h+m]++}END{n=asorti(ctr,idx);for(i=1;i<=n;i++){print "\""idx[i]"\",\""ctr[idx[i]]"\""}}' fb.csv

"Time Interval","Count of Sources"
"120000","1892288"
"123000","1892290"
"130000","1576904"
"133000","1576905"
"140000","2365357"

real    0m23.193s
user    0m23.080s
sys     0m0.096s


$ time ./counter.bash fb.csv

"Time Interval","Count of Sources"
./counter.bash: line 17: [: .0.138.1: integer expression expected
^C
real    2m27.992s
user    2m27.940s
sys     0m1.636s

相关内容