我的fb.csv
文件如下:
"Source","Time"
"192.168.137.174","120025"
"10.0.138.163","120525"
"157.240.10.13","121036"
"157.240.10.13","122536"
"157.240.10.23","123041"
"157.240.10.23","123241"
"10.0.138.163","123352"
"192.168.137.174","123952"
"157.240.10.18","124152"
"157.240.10.18","124252"
"157.240.10.23","125653"
"157.240.10.23","130053"
"192.168.137.174","130102"
"10.0.138.163","130302"
"192.168.137.174","131007"
"192.168.137.174","131352"
"157.240.10.18","132552"
"157.240.10.18","132752"
"157.240.10.23","132953"
"157.240.10.23","133253"
"192.168.137.174","133502"
"10.0.138.163","134002"
"192.168.137.174","134507"
"192.168.137.174","135752"
"157.240.10.18","140052"
"157.240.10.18","140552"
"157.240.10.23","140653"
"157.240.10.23","141053"
"192.168.137.174","141402"
"10.0.138.163","141702"
"192.168.137.174","142707"
我想按每 003000(30 分钟)的时间间隔对“来源”进行分组(计数)。
期望输出示例:
"Time Interval","Count of Sources"
"120000","4"
"123000","7"
"130000","8"
"133000","5"
"140000","7"
这个问题有解决办法吗?谢谢。
答案1
我们只需使用以下方法即可完成此操作awk
:
awk -F, 'BEGIN{print"\"Time Interval\",\"Count of Sources\""}NR>1{gsub(/"/,"",$2);h=int($2/10000)*10000;m=int(($2-h)/3000)*3000;ctr[h+m]++}END{n=asorti(ctr,idx);for(i=1;i<=n;i++){print "\""idx[i]"\",\""ctr[idx[i]]"\""}}' fb.csv
对于给定的输入文件,fb.csv
这将导致输出
"120000","4"
"123000","7"
"130000","8"
"133000","5"
"140000","7"
重要的:这需要 GNU AWK ( gawk
) 才能运行,因为它使用asorti(...)
函数按索引对关联数组进行排序。它不适用于。您可以使用mawk
找出您的默认版本。awk
awk -Wv
命令解释:
我们在这里像这样运行awk
,设置分隔列的字段分隔符,
并使用文件fb.csv
作为输入:
awk -F, '<COMMAND>' fb.csv
经过正确格式化后,命令awk
(<COMMAND>
上面的占位符)如下:
BEGIN {
print "\"Time Interval\",\"Count of Sources\""
}
NR>1 {
gsub(/"/, "", $2)
h = int($2 / 10000) * 10000
m = int(($2-h) / 3000) * 3000
ctr[h+m]++
}
END {
n = asorti(ctr, idx)
for(i=1; i<=n; i++) {
print "\"" idx[i] "\",\"" ctr[idx[i]] "\""
}
}
这看起来非常复杂(我不能否认它需要一些思考才能理解),所以我将尝试将其分解一下:
在读取文件的第一行输入之前,代码BEGIN { ... }
块将执行一次。然后,对于除第一行(“行号大于 1”)之外的每一行,NR>1 { ... }
执行该块。最后,在读取所有输入后,END { ... }
将运行该块。
现在该
BEGIN
块非常简单,它只打印新的 CSV 标题行。让我们看一下
NR>1
块。请记住,awk
将每行拆分为字段,这些字段由字段分隔符(我们使用,
参数设置的分隔符-F
)分隔。第一列/字段将存储在变量中$1
,第二列/字段将存储在变量中$2
,依此类推。我们只对第二个字段的值感兴趣,该字段包含时间。使用该
gsub(<PATTERN>, <REPLACEMENT>, <VARIABLE>)
函数,我们将所有出现的<PATTERN>
(用斜杠括起来的正则表达式,这里它只匹配引号)替换为<REPLACEMENT>
(空的,因为我们想删除它们)字符串<VARIABLE>
($2
即此处包含时间的第二个字段)。接下来,我们将时间戳解码为整小时
h
(乘以 10000)和整半小时m
(不带整小时;乘以 3000)。我们使用关联数组ctr
作为四舍五入时间戳h+m
在输入中出现频率的计数器。最后
END
,我们打印出计数器的值已排序通过四舍五入的时间戳索引。
答案2
我确信还有更优雅的方式,但我的建议是创建一个可执行文件,我们称之为counter.bash
,其脚本内容如下:
#!/bin/bash
echo '"Time Interval","Count of Sources"'
FILTRED=$(tail -n +2 "$1" | sed -e 's/^.*\,//' -e 's/"//g' | sort)
T=1
while [ $T -lt 24 ]; do
((T++)); R1=0; R2=0
for i in $FILTRED; do
hour=${i::-4}; minute=${i:2:-2}
if [ "$T" -lt "10" ]; then TT="0${T}"; else TT="${T}"; fi
if [ "$minute" -lt "30" ]; then
if [ "$hour" == "$TT" ]; then ((R1++)); fi
else
if [ "$hour" == "$TT" ]; then ((R2++)); fi
fi
done
if [ "$R1" -ne "0" ]; then echo "\"${TT}0000\",\"$R1\""; fi
if [ "$R2" -ne "0" ]; then echo "\"${TT}3000\",\"$R2\""; fi
done
然后运行:
./counter.bash fb.csv
如果结果足够,则将输出重定向到新文件:
./counter.bash fb.csv > fb.counted.csv
我比较了 Byte Commander 的答案和此脚本的性能,两者都应用于同一个不太大的文件。它们根本无法比较:
$ cat fb.csv | wc -l
3304
$ time awk -F, 'BEGIN{print"\"Time Interval\",\"Count of Sources\""}NR>1{gsub(/"/,"",$2);h=int($2/10000)*10000;m=int(($2-h)/3000)*3000;ctr[h+m]++}END{n=asorti(ctr,idx);for(i=1;i<=n;i++){print "\""idx[i]"\",\""ctr[idx[i]]"\""}}' fb.csv
"Time Interval","Count of Sources"
"120000","672"
"123000","672"
"130000","560"
"133000","560"
"140000","839"
real 0m0.017s
user 0m0.012s
sys 0m0.000s
$ time ./counter.bash fb.csv
"Time Interval","Count of Sources"
"120000","672"
"123000","672"
"130000","560"
"133000","560"
"140000","839"
real 0m2.374s
user 0m2.368s
sys 0m0.000s
当文件非常大时,我的脚本会在很长一段时间后崩溃(在不同的行上):
$ cat fb.csv | wc -l
9303745
$ time awk -F, 'BEGIN{print"\"Time Interval\",\"Count of Sources\""}NR>1{gsub(/"/,"",$2);h=int($2/10000)*10000;m=int(($2-h)/3000)*3000;ctr[h+m]++}END{n=asorti(ctr,idx);for(i=1;i<=n;i++){print "\""idx[i]"\",\""ctr[idx[i]]"\""}}' fb.csv
"Time Interval","Count of Sources"
"120000","1892288"
"123000","1892290"
"130000","1576904"
"133000","1576905"
"140000","2365357"
real 0m23.193s
user 0m23.080s
sys 0m0.096s
$ time ./counter.bash fb.csv
"Time Interval","Count of Sources"
./counter.bash: line 17: [: .0.138.1: integer expression expected
^C
real 2m27.992s
user 2m27.940s
sys 0m1.636s