查找具有重复键的记录

查找具有重复键的记录

我有多个 gzip 压缩文件,单个文件中有数百万行。一般来说,有些数字带有 1assocId和 1 IMPIIMPU、 以及MSISDN

assoc=1
IMPI=XXX
IMPU=YYY
MSISDN=ZZZ
IMSI=PPP

但在某些情况下,一个人assocId可能有多个IMPIIMPUIMSIMSISDN,如下所示:

assocId=2
IMPI=ddd
IMPI=eee
IMPU=fff
IMPU=ggg
IMSI=hhh
IMSI=iii
MSISDN=jjj
MSISDN=kkk

我想列出所有出现的assocId带有多个IMPI, IMPU,IMSIMSISDN的附件。

一个assocId可以有多个 1、2 或更多个IMPI, IMPU,IMSIMSISDN与之关联。

请建议。

答案1

我创建了测试源文件:

assocId=1
IMPI=XXX
IMPU=YYY
MSISDN=ZZZ
IMSI=PPP
assocId=2
IMPI=ddd
IMPI=eee
IMPU=fff
IMPU=ggg
IMSI=hhh
IMSI=iii
MSISDN=jjj
MSISDN=kkk
assocId=3
IMPI=XXX
IMPU=YYY
MSISDN=ZZZ
IMSI=PPP
assocId=4
IMPI=ddd
IMPI=eee
IMPU=fff
IMPU=ggg
IMSI=hhh
IMSI=iii
MSISDN=jjj
MSISDN=kkk

然后我编写了以下 GAWK 脚本:

#!/usr/bin/gawk -f
#
# Define the processing for a change of associd.
#
# NB: This function uses the GLOBAL variables:
#       IMPI
#       IMPU
#       IMSI
#       MSISDN
#
function new_assoc(assoc,     flag) {
        flag = 0
        if (IMPI > 1) flag=1
        if (IMPU > 1) flag=1
        if (IMSI > 1) flag=1
        if (MSISDN > 1) flag=1
        if (flag > 0) printf( "Found a multiple entry: %d\n", assoc )
        IMPI = IMPU = IMSI = MSISDN = 0
}
#
#       First thing, set up the field seperator.
#
BEGIN {
        FS = "="
}
#
#       Every time we hit an assoc line handle the previous one and then
#       initialise.
#
/^assocId/ {
        new_assoc( assoc )
        assoc = $2
}
#
#       Total up the four entries:
#
/^IMPI/   { IMPI++   }
/^IMPU/   { IMPU++   }
/^IMSI/   { IMSI++   }
/^MSISDN/ { MSISDN++ }
#
#       Ensure we process the last assoc on EOF:
#
END {
        new_assoc( assoc )
}

当我运行它时:

$ ./scan_it <src
Found a multiple entry: 2
Found a multiple entry: 4

我希望这将作为您需要做的事情的基础。

答案2

以下awk程序将输出assocId任何包含重复键的记录的 ID。该代码在逻辑上与以下内容大致相同马丁的回答中的代码,但寻找重复的任何记录中的键。

BEGIN { FS = "=" }

function validate() {
    # Outputs a message if any key in "keys" is associated
    # with a number greater than 1.

    for (key in keys)
        if (keys[key] > 1) {
            printf "Check assocId=%s\n", id
            break
        }
}

/^assocId=/ {
    # New record.
    # Validate the previous record and delete the count of keys.
    validate()
    id = $2
    delete keys
}

{
    # Increment the counter for this key.
    keys[$1]++
}

END {
    # Validate the last record.
    validate()
}

作为一篇难以读懂的单行诗:

awk -F = 'function v(){for(k in c)if(c[k]>1){printf "Check assocId=%s\n",id;break}}/^assocId=/{v();id=$2;delete c}{c[$1]++}END{v()}'

运行与 Martin 使用的测试数据相同,您将得到以下输出:

Check assocId=2
Check assocId=4

答案3

与之前的解决方案类似。

function count() {
    if (impi > 1) {
        print associd, "with impi repeated ", impi, "times"
    }
    
    if (impu > 1) {
        print associd, "with impu repeated ", impu, "times"
    }

    if (msisdn > 1) {
        print associd, "with msisdn repeated ", msisdn, "times"
    }
}

/assocId/ {
    count()
    impi = 0
    impu = 0
    msisdn = 0
    associd = $0
}

/IMPI/ {
    impi += 1
}

/IMPU/ {
    impu += 1
}

/MSISDN/ {
    msisdn += 1
}

END {
    count()
}
assocId=2 with impi repeated  2 times
assocId=2 with impu repeated  2 times
assocId=2 with msisdn repeated  2 times
assocId=4 with impi repeated  2 times
assocId=4 with impu repeated  2 times
assocId=4 with msisdn repeated  2 times

不过,我希望有一种方法只能调用count一次。

相关内容