如何仅通过比较 2 个不同文件中的 2 列来附加缺失的行

如何仅通过比较 2 个不同文件中的 2 列来附加缺失的行

我有两个文件

文件1(参考文件)

xxx xxxxx 00    
xxx xxxxx 01    
xxx xxxxx 02    
xxx xxxxx 03    
xxx xxxxx 04    
xxx xxxxx 00    
xxx xxxxx 01     
xxx xxxxx 02    
xxx xxxxx 03    
xxx xxxxx 04   

文件2

12345 2021/04/02 00    
1212  2021/04/02 01    
12123 2021/04/02 02    
12123 2021/04/02 04    
1223  2021/04/03 01    
124   2021/04/03 02    
123   2021/04/03 03    

我想比较每个文件的最后一个字段并附加第一个文件(我的参考文件)中缺少的行

例如我希望输出是

12345 2021/04/02 00    
1212  2021/04/02 01    
12123 2021/04/02 02    
xxx   xxxxx      03    
12123 2021/04/02 04    
xxx   xxxxx      00     
1223  2021/04/03 01    
124   2021/04/03 02    
123   2021/04/03 03    
xxx   xxxxx      04    

我尝试过使用awk -F ' ' 'NR==FNR{a[$2]++;next}a[$2] && $1>=00' test2.txt test1.txt,它会附加 file1 中缺少的第三个值,但输出也会删除我需要的数据(第二个和第三个字段)。

答案1

也许是这样的?

cat file2 | awk '!(1 in f) {if ((getline l < "-") == 1) split(l, f)} $3!=f[3] {print;next} {print l; delete f}' file1 | column -t

请注意,该脚本期望file1争论到 awk,同时期望file2标准输入。我使用了“猫的无用使用”来更明确地表明这一点,但自然地,您可以将其作为< file2重定向提供。事实上,您甚至可以将文件名嵌入到脚本本身中,以代替"file2",但这种方式更灵活一些。"-"getline

另请注意,这两个文件预计将在 field3 值方面开始“同步”,或者如果这对您的用例有意义,则可能会file2“提前” 。file1

为了便于阅读,脚本单独进行了分解,并详细注释了解释:

# Check if our `real_fields` array is not existent.
# NOTE: we use the `<index> in <array>` construct
# in order to force awk treat `real_fields` name as an
# array (instead of as a scalar as it would by default)
# and build it in an empty state
!(1 in real_fields) {
    # get the next line (if any) from the "real" file
    if ((getline real_line < "-") == 1)
        # split that line in separate fields populating
        # our `real_fields` array
        split(real_line, real_fields)
        # awk split function creates an array with numeric
        # indexes for each field found as per FS separator
}
# if field3 of the current line of the "reference"
# file does not match the current line of the "real" file..
$3!=real_fields[3] {
    # print current line of "reference" file
    print
    # go reading next line of "reference" file thus
    # skipping the final awk pattern
    next
}
# final awk pattern, we get here only if the pattern
# above did not match, i.e. if field3 values from both
# files match
{
    # print current line of "real" file
    print real_line
    # delete our real_fields array, thus triggering
    # the fetching of the next line of "real" file as
    # performed by the first awk pattern
    delete real_fields
}

答案2

您需要设置数组的顺序,否则 awk 将重新排序您的行。

#!/usr/bin/awk -f

BEGIN {
    PROCINFO["sorted_in"] = "@ind_str_asc"
}
NR==FNR {
    a[i++,$3]=$0
    next
} 
{
    for (c in a) {
        split(c, s, SUBSEP)
        if (s[2] == $3) {
            print $0
            getline
        } else {
            print a[c]
        }
    }
}

./script.awk file1 file2

相关内容