使用通用 IDS 合并多个文件

Question 1

使用单一工具这可能是不可能的。这是一个基于脚本的建议，涉及对sort和两个临时外部文件的调用。

#!/bin/bash

# The number of columns is equal to the number of input files, which is
# equal to the number of command-line arguments.
NUMCOLS=$#


# Use associative container to record all "IDs" and associated fields
declare -A entries
col=0


# Read the fields from all files and store them so that the field values can be
# associated with the file they came from (= the column they belong).
for FILE in "$@"
do
    while read id value
    do
        SORTKEY="$id"__"$col"
        entries[$SORTKEY]="$value"
        echo "$id" >> "tmp.ids"
    done < $FILE
    let col=$col+1
done

# Sort the IDs
sort -u "tmp.ids" > "tmp.ids.sorted"


# Read the sorted IDs back in and generate output lines, where the
# column fields are taken from the associative container "entries" and
# tab-separated.
# If "entries" doesn't contain a value for a given key, output "-NA-" instead.

while read id
do
    LINE="$id"
    for (( col=0; col<NUMCOLS; col++ ))
    do
        SORTKEY="$id"__"$col"
        if [[ -z "${entries[$SORTKEY]}" ]]
        then
            LINE=$(printf "%s\t-NA-" "$LINE")
        else
            LINE=$(printf "%s\t%s" "$LINE" "${entries[$SORTKEY]}")
        fi
    done
    echo "$LINE" >> "outfile.txt"
done < "tmp.ids.sorted"

rm tmp.ids tmp.ids.sorted

您可以将此称为./sortscript.sh <file1> <file2> ... <fileN>.

这将生成一个关联容器entries，并将从输入文件读取的所有字段存储在从“ID”字段和列号生成的键下。 ID 被写入外部文件中，tmp.ids以便可以对它们进行排序，这似乎就是您想要的。

排序后，ID 被读回。然后，对于每个 ID，从容器中读取属于该键的所有可用字段entries并将其放置在输出行（变量LINE）上。如果没有可用于特定 ID/列组合的值，请-NA-改为写入。

然后将输出行写入文件outfile.txt。

Answer

使用单一工具这可能是不可能的。这是一个基于脚本的建议，涉及对sort和两个临时外部文件的调用。

#!/bin/bash

# The number of columns is equal to the number of input files, which is
# equal to the number of command-line arguments.
NUMCOLS=$#


# Use associative container to record all "IDs" and associated fields
declare -A entries
col=0


# Read the fields from all files and store them so that the field values can be
# associated with the file they came from (= the column they belong).
for FILE in "$@"
do
    while read id value
    do
        SORTKEY="$id"__"$col"
        entries[$SORTKEY]="$value"
        echo "$id" >> "tmp.ids"
    done < $FILE
    let col=$col+1
done

# Sort the IDs
sort -u "tmp.ids" > "tmp.ids.sorted"


# Read the sorted IDs back in and generate output lines, where the
# column fields are taken from the associative container "entries" and
# tab-separated.
# If "entries" doesn't contain a value for a given key, output "-NA-" instead.

while read id
do
    LINE="$id"
    for (( col=0; col<NUMCOLS; col++ ))
    do
        SORTKEY="$id"__"$col"
        if [[ -z "${entries[$SORTKEY]}" ]]
        then
            LINE=$(printf "%s\t-NA-" "$LINE")
        else
            LINE=$(printf "%s\t%s" "$LINE" "${entries[$SORTKEY]}")
        fi
    done
    echo "$LINE" >> "outfile.txt"
done < "tmp.ids.sorted"

rm tmp.ids tmp.ids.sorted

您可以将此称为./sortscript.sh <file1> <file2> ... <fileN>.

这将生成一个关联容器entries，并将从输入文件读取的所有字段存储在从“ID”字段和列号生成的键下。 ID 被写入外部文件中，tmp.ids以便可以对它们进行排序，这似乎就是您想要的。

排序后，ID 被读回。然后，对于每个 ID，从容器中读取属于该键的所有可用字段entries并将其放置在输出行（变量LINE）上。如果没有可用于特定 ID/列组合的值，请-NA-改为写入。

然后将输出行写入文件outfile.txt。

Question 2

您可以使用该join实用程序两次来在三个文件上生成两个“外部联接”。假设所有三个文件都是制表符分隔的，首先是前两个文件：

$ join -a 1 -a 2 -o 0,1.2,2.2 -e '-NA-' -t $'\t' <( sort File1 ) <( sort File2 )
MYORGANISM_I_05140.t1   Atypical/PIKK/FRAP      VALUES to be taken
MYORGANISM_I_06518.t1   CAMK/MLCK       -NA-
MYORGANISM_I_00854.t1   TK-assoc/SH2/SH2-R      -NA-
MYORGANISM_I_12755.t1   TK-assoc/SH2/Unique     -NA-
MYORGANISM_I_12766.t1   -NA-    what

这要求join实用程序在第一个字段（默认值）上加入已排序的文件。我们明确表示-a 1 -a2要从两个文件中获取所有行，即使它们不匹配，并且-o 0,1.2,2.2我们请求输出包含连接字段（第一列）以及每个文件的第二列。该-e '-NA-'选项指定用什么字符串填充空字段。

上面为我们提供了一个新的数据集，我们可以在与第三个文件的第二次连接中使用它。为了简单起见，假设上面的结果在tmpdata（重定向到那里之后）可用，那么

$ join -a 1 -a 2 -o 0,1.2,1.3,2.2 -e '-NA-' -t $'\t' tmpdata <( sort FILE3 )
MYORGANISM_I_00854.t1   TK-assoc/SH2/SH2-R      -NA-    -NA-
MYORGANISM_I_05140.t1   Atypical/PIKK/FRAP      VALUES to be taken      -NA-
MYORGANISM_I_06518.t1   CAMK/MLCK       -NA-    -NA-
MYORGANISM_I_12755.t1   TK-assoc/SH2/Unique     -NA-    -NA-
MYORGANISM_I_12766.t1   -NA-    what    -NA-
MYORGANISM_I_16941.t1   -NA-    -NA-    OK
MYORGANISM_I_93484.t1   -NA-    -NA-    LET IT BE

这或多或少地重复了之前的“外部连接”，但还添加了一个带有-o选项的额外列。

Answer