我有多个 gzip 压缩文件,单个文件中有数百万行。一般来说,有些数字带有 1assocId
和 1 IMPI
、IMPU
、 以及MSISDN
。
assoc=1
IMPI=XXX
IMPU=YYY
MSISDN=ZZZ
IMSI=PPP
但在某些情况下,一个人assocId
可能有多个IMPI
、IMPU
、IMSI
和MSISDN
,如下所示:
assocId=2
IMPI=ddd
IMPI=eee
IMPU=fff
IMPU=ggg
IMSI=hhh
IMSI=iii
MSISDN=jjj
MSISDN=kkk
我想列出所有出现的assocId
带有多个IMPI
, IMPU
,IMSI
和MSISDN
的附件。
一个assocId
可以有多个 1、2 或更多个IMPI
, IMPU
,IMSI
和MSISDN
与之关联。
请建议。
答案1
我创建了测试源文件:
assocId=1
IMPI=XXX
IMPU=YYY
MSISDN=ZZZ
IMSI=PPP
assocId=2
IMPI=ddd
IMPI=eee
IMPU=fff
IMPU=ggg
IMSI=hhh
IMSI=iii
MSISDN=jjj
MSISDN=kkk
assocId=3
IMPI=XXX
IMPU=YYY
MSISDN=ZZZ
IMSI=PPP
assocId=4
IMPI=ddd
IMPI=eee
IMPU=fff
IMPU=ggg
IMSI=hhh
IMSI=iii
MSISDN=jjj
MSISDN=kkk
然后我编写了以下 GAWK 脚本:
#!/usr/bin/gawk -f
#
# Define the processing for a change of associd.
#
# NB: This function uses the GLOBAL variables:
# IMPI
# IMPU
# IMSI
# MSISDN
#
function new_assoc(assoc, flag) {
flag = 0
if (IMPI > 1) flag=1
if (IMPU > 1) flag=1
if (IMSI > 1) flag=1
if (MSISDN > 1) flag=1
if (flag > 0) printf( "Found a multiple entry: %d\n", assoc )
IMPI = IMPU = IMSI = MSISDN = 0
}
#
# First thing, set up the field seperator.
#
BEGIN {
FS = "="
}
#
# Every time we hit an assoc line handle the previous one and then
# initialise.
#
/^assocId/ {
new_assoc( assoc )
assoc = $2
}
#
# Total up the four entries:
#
/^IMPI/ { IMPI++ }
/^IMPU/ { IMPU++ }
/^IMSI/ { IMSI++ }
/^MSISDN/ { MSISDN++ }
#
# Ensure we process the last assoc on EOF:
#
END {
new_assoc( assoc )
}
当我运行它时:
$ ./scan_it <src
Found a multiple entry: 2
Found a multiple entry: 4
我希望这将作为您需要做的事情的基础。
答案2
以下awk
程序将输出assocId
任何包含重复键的记录的 ID。该代码在逻辑上与以下内容大致相同马丁的回答中的代码,但寻找重复的任何记录中的键。
BEGIN { FS = "=" }
function validate() {
# Outputs a message if any key in "keys" is associated
# with a number greater than 1.
for (key in keys)
if (keys[key] > 1) {
printf "Check assocId=%s\n", id
break
}
}
/^assocId=/ {
# New record.
# Validate the previous record and delete the count of keys.
validate()
id = $2
delete keys
}
{
# Increment the counter for this key.
keys[$1]++
}
END {
# Validate the last record.
validate()
}
作为一篇难以读懂的单行诗:
awk -F = 'function v(){for(k in c)if(c[k]>1){printf "Check assocId=%s\n",id;break}}/^assocId=/{v();id=$2;delete c}{c[$1]++}END{v()}'
运行与 Martin 使用的测试数据相同,您将得到以下输出:
Check assocId=2
Check assocId=4
答案3
与之前的解决方案类似。
function count() {
if (impi > 1) {
print associd, "with impi repeated ", impi, "times"
}
if (impu > 1) {
print associd, "with impu repeated ", impu, "times"
}
if (msisdn > 1) {
print associd, "with msisdn repeated ", msisdn, "times"
}
}
/assocId/ {
count()
impi = 0
impu = 0
msisdn = 0
associd = $0
}
/IMPI/ {
impi += 1
}
/IMPU/ {
impu += 1
}
/MSISDN/ {
msisdn += 1
}
END {
count()
}
assocId=2 with impi repeated 2 times
assocId=2 with impu repeated 2 times
assocId=2 with msisdn repeated 2 times
assocId=4 with impi repeated 2 times
assocId=4 with impu repeated 2 times
assocId=4 with msisdn repeated 2 times
不过,我希望有一种方法只能调用count
一次。