根据第一列将多个具有不等行/行的管道分隔文件合并为一个文件

Question 1

仅适用于所示情况，每个文件中有两列和三个文件：

$ join -t '|' -o0,1.2,2.2 -a 1 -a 2 test[12].txt | join -t '|' -o0,1.2,1.3,2.2 -a 1 -a 2 - test3.txt
1|1|4|7
2|2|5|8
3||6|9
4|||10

也就是说，对前两个文件执行关系完全外连接，并以相同的方式将其输出与第三个文件连接。正是它-a 1 -a 2使其成为完整的外部连接。使用 GNU join，您可以将-o选项及其选项参数替换为-o auto.

这可以概括为脚本：

#!/bin/sh

# sanity check
if [ "$#" -lt 2 ]; then
    echo 'require at least two files' >&2
    exit 1
fi

# temporary files
result=$(mktemp)  # the result of a join
tmpfile=$(mktemp) # temporary file holding a previous result

# remove temporary files on exit
trap 'rm -f "$result" "$tmpfile"' EXIT

# join the first two files
join -t '|' -o auto -a 1 -a 2 "$1" "$2" >"$result"
shift 2

# loop over the remaining files, adding to the result with each
for pathname do
    mv "$result" "$tmpfile"
    join -t '|' -o auto -a 1 -a 2 "$tmpfile" "$pathname" >"$result"
done

# done, output result
cat "$result"

该脚本依赖于 GNU 的join选项-o auto，并且假设连接将发生在|每个文件中的第一个 - 分隔字段上，并且文件在此字段上按字典顺序排序。

它连接前两个文件，然后添加到该连接的结果，每个剩余文件一次。

问题中的第一个例子：

$ ./script.sh test[123].txt
1|1|4|7
2|2|5|8
3||6|9
4|||10

问题中的第二个示例（请注意，在问题中，显示了错误数量的空字段）：

$ ./script.sh test[123].txt
1|1|2|4|7
2|3|4|5|8
3|||6|9
4||||10

如果文件未排序，那么您可以随时对它们进行排序（注意：切换到bash此处进行进程替换）：

#!/bin/bash

# sanity check
if [ "$#" -lt 2 ]; then
    echo 'require at least two files' >&2
    exit 1
fi

# temporary files
result=$(mktemp)  # the result of a join
tmpfile=$(mktemp) # temporary file holding a previous result

# remove temporary files on exit
trap 'rm -f "$result" "$tmpfile"' EXIT

# join the first two files
join -t '|' -o auto -a 1 -a 2 \
    <( sort -t '|' -k1,1 "$1" ) \
    <( sort -t '|' -k1,1 "$2" ) >"$result"
shift 2

# loop over the remaining files, adding to the result with each
for pathname do
    mv "$result" "$tmpfile"

    # note: $tmpfile" would already be sorted

    join -t '|' -o auto -a 1 -a 2 \
        "$tmpfile" \
        <( sort -t '|' -k1,1 "$pathname" ) >"$result"
done

# done, output result
cat "$result"

要允许用户连接另一个字段（使用-f），请使用另一个分隔符（使用-d），并使用另一个连接类型（使用-j），

#!/bin/bash

# default values
delim='|'
field='1'

join_type=( -a 1 -a 2 ) # full outer join by default

# override the above defaults with options given to us by the user
# on the command line
while getopts 'd:f:j:' opt; do
    case "$opt" in
        d) delim="$OPTARG" ;;
        f) field="$OPTARG" ;;
        j)
            case "$OPTARG" in
                inner) join_type=( ) ;;
                left)  join_type=( -a 1 ) ;;
                right) join_type=( -a 2 ) ;;
                full)  join_type=( -a 1 -a 2 ) ;;
                *) printf 'unknown join type "%s", expected inner, left, right or full\n' "$OPTARG" >&2
                   exit 1
            esac ;;
        *) echo 'error in command line parsing' >&2
           exit 1
    esac
done

shift "$(( OPTIND - 1 ))"

# sanity check
if [ "$#" -lt 2 ]; then
    echo 'require at least two files' >&2
    exit 1
fi

# temporary files
result=$(mktemp)  # the result of a join
tmpfile=$(mktemp) # temporary file holding a previous result

# remove temporary files on exit
trap 'rm -f "$result" "$tmpfile"' EXIT

# join the first two files
join -t "$delim" -j "$field" -o auto "${join_type[@]}" \
    <( sort -t "$delim" -k"$field,$field" "$1" ) \
    <( sort -t "$delim" -k"$field,$field" "$2" ) >"$result"
shift 2

# loop over the remaining files, adding to the result with each
for pathname do
    mv "$result" "$tmpfile"

    # note: $tmpfile would already be sorted and
    #       the join field is the first field in that file

    join -t "$delim" -2 "$field" -o auto "${join_type[@]}" \
        "$tmpfile" \
        <( sort -t "$delim" -k "$field,$field" "$pathname" ) >"$result"
done

# done, output result
cat "$result"

通过重新运行第二个示例进行测试：

$ ./script.sh test[123].txt
1|1|2|4|7
2|3|4|5|8
3|||6|9
4||||10

在相同的文件上运行，但在第二个字段上加入：

$ ./script.sh -f 2 test[123].txt
1|1|2||
10||||4
3|2|4||
4|||1|
5|||2|
6|||3|
7||||1
8||||2
9||||3

进行内连接：

$ ./script.sh -j inner test[123].txt
1|1|2|4|7
2|3|4|5|8

Answer