合并 2 个文件并保留其中一个重复

Question 1

空间正在更改，因为您正在打印 i) 第一个和第二个字段连接以及 ii) 第三个字段。默认情况下，awk使用空格作为输出字段分隔符 ( OFS)，这样会弄乱您的间距。一个简单的解决方案是将行本身 ( $0) 保存在数组中而不是字段中：

a[$1$2]=$0;

但是，您的脚本无论如何都不会执行您想要的操作。它只会打印 file1 中存在于 file2 中的行，因此仅在 file1 中的任何内容都将被跳过。根据您所需的输出，您希望打印两个文件中的所有行，并且如果 file2 的任何行与 file1 中的前两个字段具有相同的前两个字段，则仅打印 file1 中的相应行。您可以通过以下方式执行此操作awk：

awk 'FNR==NR{a[$1$2]=$0; print} !($1$2 in a) {print}' file1 file2

file1这将保存数组中的每一行并打印它。然后，在file2处理时，它将打印前两个字段不在中的任何行a。

请注意，您还可以使用以下方法执行此操作sort：

$ sort -uk1,2 file1 file2 
11111111    abc12345    Y
22222222    xyz23456    Y
33333333    kbc34567

您只需要确保两个文件中的空白量相同（您的示例中并非如此），或者使用以下命令实现此目的：

$ sed  's/  */\t/g' file1 file2 | sort -uk1,2 
11111111    abc12345    Y
22222222    xyz23456    Y
33333333    kbc34567

Answer

空间正在更改，因为您正在打印 i) 第一个和第二个字段连接以及 ii) 第三个字段。默认情况下，awk使用空格作为输出字段分隔符 ( OFS)，这样会弄乱您的间距。一个简单的解决方案是将行本身 ( $0) 保存在数组中而不是字段中：

a[$1$2]=$0;

但是，您的脚本无论如何都不会执行您想要的操作。它只会打印 file1 中存在于 file2 中的行，因此仅在 file1 中的任何内容都将被跳过。根据您所需的输出，您希望打印两个文件中的所有行，并且如果 file2 的任何行与 file1 中的前两个字段具有相同的前两个字段，则仅打印 file1 中的相应行。您可以通过以下方式执行此操作awk：

awk 'FNR==NR{a[$1$2]=$0; print} !($1$2 in a) {print}' file1 file2

file1这将保存数组中的每一行并打印它。然后，在file2处理时，它将打印前两个字段不在中的任何行a。

请注意，您还可以使用以下方法执行此操作sort：

$ sort -uk1,2 file1 file2 
11111111    abc12345    Y
22222222    xyz23456    Y
33333333    kbc34567

您只需要确保两个文件中的空白量相同（您的示例中并非如此），或者使用以下命令实现此目的：

$ sed  's/  */\t/g' file1 file2 | sort -uk1,2 
11111111    abc12345    Y
22222222    xyz23456    Y
33333333    kbc34567

Question 2

根据您的文件有多大，这可能不是最有效的方法，但我认为它适用于特定情况。它不要求文件按任何特定顺序排列，但确实要求您始终优先选择 File1 而不是 File2：

#!/bin/bash
# Make a list of the unique identifiers in each of the files, changing the whitespace in between into a comma.
awk '{print $1 "," $2}' File1 File2| sort | uniq |
# Loop through all the unique identifiers we just found
while read l; do
    # Create a regular expression for each identifier to use as
    #  a search term, changing the comma into "any number of whitespaces"
    searchterm=$(echo $l | sed 's/,/\\\s*/')
    # if this pattern exists in File1
    if $(grep -E "$searchterm" File1 >& /dev/null); then
        # print it out
        grep -E "$searchterm" File1
    else
        # otherwise, print it if it's in File2
        grep -E "$searchterm" File2
    fi
done

如果您想要 File3，您可以将其另存为脚本并将输出发送到那里

#copy to merge_uniq.sh
chmod +x merge_uniq.sh
merge_uniq.sh > File3

Answer

根据您的文件有多大，这可能不是最有效的方法，但我认为它适用于特定情况。它不要求文件按任何特定顺序排列，但确实要求您始终优先选择 File1 而不是 File2：

#!/bin/bash
# Make a list of the unique identifiers in each of the files, changing the whitespace in between into a comma.
awk '{print $1 "," $2}' File1 File2| sort | uniq |
# Loop through all the unique identifiers we just found
while read l; do
    # Create a regular expression for each identifier to use as
    #  a search term, changing the comma into "any number of whitespaces"
    searchterm=$(echo $l | sed 's/,/\\\s*/')
    # if this pattern exists in File1
    if $(grep -E "$searchterm" File1 >& /dev/null); then
        # print it out
        grep -E "$searchterm" File1
    else
        # otherwise, print it if it's in File2
        grep -E "$searchterm" File2
    fi
done

如果您想要 File3，您可以将其另存为脚本并将输出发送到那里

#copy to merge_uniq.sh
chmod +x merge_uniq.sh
merge_uniq.sh > File3

Question 3

awk 'BEGIN{i=0} {if (!($1$2  in a)) {a[$1$2]=$0; index_array[i] =$1$2; i++} } END{for (j=0; j<i; j++) print a[index_array[j]]}' 1 2

11111111        abc12345   Y

22222222        xyz23456   Y
33333333       kbc34567

Answer

awk 'BEGIN{i=0} {if (!($1$2  in a)) {a[$1$2]=$0; index_array[i] =$1$2; i++} } END{for (j=0; j<i; j++) print a[index_array[j]]}' 1 2

11111111        abc12345   Y

22222222        xyz23456   Y
33333333       kbc34567

合并 2 个文件并保留其中一个重复

答案1

答案2

答案3

相关内容