如何仅使用两列比较两个文件并打印差异（不排序）？

Question 1

和awk：

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {trans[$2"|"$3]++; next;} FNR==1 {print} FNR>1 {if(!trans[$2"|"$3]) print}' file2 file1

首先file2读入，并使用第 2 列和第 3 列的值将其作为键存储在列表中。
如果file1读入，则打印标题行。对于下一行，我们检查之前创建的列表中是否存在具有第 2 列和第 3 列值的键。如果没有，我们打印出该行。

Answer

和awk：

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {trans[$2"|"$3]++; next;} FNR==1 {print} FNR>1 {if(!trans[$2"|"$3]) print}' file2 file1

首先file2读入，并使用第 2 列和第 3 列的值将其作为键存储在列表中。
如果file1读入，则打印标题行。对于下一行，我们检查之前创建的列表中是否存在具有第 2 列和第 3 列值的键。如果没有，我们打印出该行。

Question 2

文件比较的方式没有明确解释/定义。

然而，这并不妨碍我尝试读懂你的想法......

据我了解，文件2是一种数据库文件或参考。文件 1 据称包含新数据。

我理解的“比较”：如果文件 1 的第 2 或第 3 列的值已在文件 2（即引用）中找到，则不要打印/包含它。否则打印/包含它。

好消息是，它确实不需要排序......正如您所要求的那样......。

下面是一个带有 2 个参数的脚本：第一个是新的数据文件（示例中的文件 1）。第二个是数据库文件（示例中的文件 2）。

#!/bin/bash

new_file=$1
db_file=$2

# Just checking the last parameter
if [ "x" = "x$db_file" ]; then
    echo >&2 "[ERROR] This scripts expect 2 file path as parameter."
    exit 1
fi

if [ ! -f $new_file ]; then
    echo >&2 "[ERROR] First parameter file doesn't exist."
    exit 2
fi

if [ ! -f $db_file ]; then
    echo >&2 "[ERROR] First parameter file doesn't exist."
    exit 3
fi


declare -A data_base

# Open both files and assign to file descriptor 10 and 11
exec 10< $new_file
exec 11< $db_file

# Step 1
# Building map of base data first (for the comparison to happen in next step)
first_line=1
while [ /bin/true ]; 
do
    read -u 11 db_file_col1 db_file_col2 db_file_col3 db_file_rest  || {
        break;
    }

    # Skipping the header so that it will appear in the diff as shown in the example
    if [  $first_line -ne 0 ]; then
        first_line=0
        continue
    fi


    # Creating map from Col 2 and Col 3 (keys) to the whole line (value)
    data_base[$db_file_col2]="$db_file_col1 $db_file_col2 $db_file_col3 $db_file_rest"
    data_base[$db_file_col3]="$db_file_col1 $db_file_col2 $db_file_col3 $db_file_rest"
done


# Step 2
# Actual comparison ... 
while [ /bin/true ]; 
do
    read -u 10 new_file_col1 new_file_col2 new_file_col3 new_file_rest  || {
        break;
    }

    if [ -z "${data_base[$new_file_col2]}" ] && [ -z "${data_base[$new_file_col3]}" ]; then
        echo "$new_file_col1 $new_file_col2 $new_file_col3 $new_file_rest"
    fi

done

例如，如果将脚本保存到名为 process.sh 的文件中（然后使用“chmod 755 process.sh”使其可执行），则执行：

./process.sh file1 file2

同时导致您确切的预期输出/结果。

注意：此脚本将文件 2 的内容至少两倍保存到内存中。确保你有足够的内存....

Answer