如何合并制表符分隔的文件？

Question 1

用于此目的的经典 UNIX 工具是join：

NAME
       join - join lines of two files on a common field

SYNOPSIS
       join [OPTION]... FILE1 FILE2

DESCRIPTION
       For  each  pair of input lines with identical join fields, write a line
       to standard output.  The default join field is the first, delimited  by
       blanks.

但是，joini) 需要对输入进行排序才能工作，并且 ii) 只能处理 2 个文件。因此，您可以做一些丑陋而不优雅的事情，例如：

按第二个字段对每个文件进行排序并保存为新文件

sort -k2 file1 > sorted1
sort -k2 file2 > sorted2
sort -k2 file3 > sorted3

将文件 1 和文件 2 合并为一个新文件，然后合并第三个文件

$ join -j2 --nocheck-order sorted1 sorted2 > newfile
$ join -o 1.2,1.3,2.1,1.1  -1 1 -2 2 --nocheck-order newfile sorted3 
10 3 9 Hac.2
1 33 23 Hhe.7
2 15 70 Hpyl.1

使用的选项是：

   -1 FIELD
          join on this FIELD of file 1

   -2 FIELD
          join on this FIELD of file 2
   -j FIELD
          equivalent to '-1 FIELD -2 FIELD'

   --nocheck-order
          do not check that the input is correctly sorted

   -o FORMAT
          obey FORMAT while constructing output line
   FORMAT is one or more  comma  or  blank  separated
   specifications, each being 'FILENUM.FIELD' or '0'.

因此，该命令将连接第一个文件的第 1 个字段和第二个文件的第二个字段，并将打印第一个文件的第二个字段（1.2），然后打印第一个文件的第三个字段（1.3），第二个文件的第一个字段（2.1）和第一个文件的第 1 个字段（1.1)。

或者，你可以将整个过程组合成一个非常复杂的命令：

$ join -o 1.1,2.2,2.3,2.1 -1 2 -2 1  --nocheck-order <(sort -k2 file3) \
      <(join -j2  --nocheck-order <(sort -k2 file1) <(sort -k2 file2)) 
9 10 3 Hac.2
23 1 33 Hhe.7
70 2 15 Hpyl.1

如果你不喜欢神秘的命令行，你可以使用一个小脚本：

$ awk '{a[$NF]=$1"\t"a[$NF];} END{for(i in a){print a[i],i}}' file{1,2,3} 
23  33  1    Hhe.7
9   3   10   Hac.2
70  15  2    Hpyl.1

Answer

用于此目的的经典 UNIX 工具是join：

NAME
       join - join lines of two files on a common field

SYNOPSIS
       join [OPTION]... FILE1 FILE2

DESCRIPTION
       For  each  pair of input lines with identical join fields, write a line
       to standard output.  The default join field is the first, delimited  by
       blanks.

但是，joini) 需要对输入进行排序才能工作，并且 ii) 只能处理 2 个文件。因此，您可以做一些丑陋而不优雅的事情，例如：

按第二个字段对每个文件进行排序并保存为新文件

sort -k2 file1 > sorted1
sort -k2 file2 > sorted2
sort -k2 file3 > sorted3

将文件 1 和文件 2 合并为一个新文件，然后合并第三个文件

$ join -j2 --nocheck-order sorted1 sorted2 > newfile
$ join -o 1.2,1.3,2.1,1.1  -1 1 -2 2 --nocheck-order newfile sorted3 
10 3 9 Hac.2
1 33 23 Hhe.7
2 15 70 Hpyl.1

使用的选项是：

   -1 FIELD
          join on this FIELD of file 1

   -2 FIELD
          join on this FIELD of file 2
   -j FIELD
          equivalent to '-1 FIELD -2 FIELD'

   --nocheck-order
          do not check that the input is correctly sorted

   -o FORMAT
          obey FORMAT while constructing output line
   FORMAT is one or more  comma  or  blank  separated
   specifications, each being 'FILENUM.FIELD' or '0'.

因此，该命令将连接第一个文件的第 1 个字段和第二个文件的第二个字段，并将打印第一个文件的第二个字段（1.2），然后打印第一个文件的第三个字段（1.3），第二个文件的第一个字段（2.1）和第一个文件的第 1 个字段（1.1)。

或者，你可以将整个过程组合成一个非常复杂的命令：

$ join -o 1.1,2.2,2.3,2.1 -1 2 -2 1  --nocheck-order <(sort -k2 file3) \
      <(join -j2  --nocheck-order <(sort -k2 file1) <(sort -k2 file2)) 
9 10 3 Hac.2
23 1 33 Hhe.7
70 2 15 Hpyl.1

如果你不喜欢神秘的命令行，你可以使用一个小脚本：

$ awk '{a[$NF]=$1"\t"a[$NF];} END{for(i in a){print a[i],i}}' file{1,2,3} 
23  33  1    Hhe.7
9   3   10   Hac.2
70  15  2    Hpyl.1

Question 2

这是的一项工作join，它可以连接两个文件的公共字段：

$ join -11 -22 -o1.2,1.3,2.1,0 <(join -j2 <(sort -k2,2 f1.txt) <(sort -k2,2 f2.txt)) <(sort -k2,2 f3.txt)
10 3 9 Hac.2
1 33 23 Hhe.7
2 15 70 Hpyl.1

由于join每次仅接受两个输入文件，我们使用过程替换（）将前两个文件的输出与第三个文件<()的输出传递。join

Answer

这是的一项工作join，它可以连接两个文件的公共字段：

$ join -11 -22 -o1.2,1.3,2.1,0 <(join -j2 <(sort -k2,2 f1.txt) <(sort -k2,2 f2.txt)) <(sort -k2,2 f3.txt)
10 3 9 Hac.2
1 33 23 Hhe.7
2 15 70 Hpyl.1

由于join每次仅接受两个输入文件，我们使用过程替换（）将前两个文件的输出与第三个文件<()的输出传递。join

Question 3

在一个小的 Python 脚本中，你可以组合一个无限文件数量:

#!/usr/bin/env python3
import sys

#read the files, split the lines for reordering
lines = sum([[l.strip().split() for l in open(f).readlines()]\
             for f in sys.argv[1:]], [])
# get the unique last sections
values = set(map(lambda x:x[1], lines))
# combine them with the combined first sections
newlist = [[y[0] for y in lines if y[1]==x]+[x] for x in values]
for l in newlist:
    print(("\t").join(l))

将其复制到一个空文件中，另存为merge.py，通过命令运行：

python3 /path/to/merge.py file1, file2, file3 (file4, file5 etc.)

示例文件上的输出：

10  3   9   Hac.2
1   33  23  Hhe.7
2   15  70  Hpyl.1

添加更多文件

如上所述，如果我添加第 4 个文件，文件数量原则上是无限的：

40   Hhe.7
50   Hpyl.1
60   Hac.2

并运行命令：

python3 /path/to/merge.py file1, file2, file3, file4

输出将是：

40  23  33  1   Hhe.7
50  70  15  2   Hpyl.1
60  9   3   10  Hac.2

Answer

在一个小的 Python 脚本中，你可以组合一个无限文件数量:

#!/usr/bin/env python3
import sys

#read the files, split the lines for reordering
lines = sum([[l.strip().split() for l in open(f).readlines()]\
             for f in sys.argv[1:]], [])
# get the unique last sections
values = set(map(lambda x:x[1], lines))
# combine them with the combined first sections
newlist = [[y[0] for y in lines if y[1]==x]+[x] for x in values]
for l in newlist:
    print(("\t").join(l))

将其复制到一个空文件中，另存为merge.py，通过命令运行：

python3 /path/to/merge.py file1, file2, file3 (file4, file5 etc.)

示例文件上的输出：

10  3   9   Hac.2
1   33  23  Hhe.7
2   15  70  Hpyl.1

添加更多文件

如上所述，如果我添加第 4 个文件，文件数量原则上是无限的：

40   Hhe.7
50   Hpyl.1
60   Hac.2

并运行命令：

python3 /path/to/merge.py file1, file2, file3, file4

输出将是：

40  23  33  1   Hhe.7
50  70  15  2   Hpyl.1
60  9   3   10  Hac.2

Question 4

回答者：

shell 脚本 – 合并一些制表符分隔的文件

以下脚本应该对所有作为参数传递的制表符分隔文件的第 1 列（字段）进行外连接。它使用加入命令，对已排序的文件进行外连接，每次 2 个文件。

它将连接文件中的每一行，包括标题行。如果您希望排除标题，请将这两个sort命令更改为生成忽略标题的排序文件的内容。
#!/bin/sh
if test $# -lt 2
then
    echo usage: gjoin file1 file2 ...
    exit 1
fi
sort -t $'\t' -k 1 "$1" > result
shift
for f in "$@"
do
    sort -t $'\t' -k 1 "$f" > temp
    join -1 1 -2 1 -t $'\t' result temp > newresult
    mv newresult result
done
cat result
rm result temp
如果您使用的是旧版 shell，$'\t'则不会被制表符替换，因此您需要使用 'TAB'，在引号之间放置文字制表符。

如果您可以使用现代 shell（例如 bash 或 ksh），则可以进行优化/bin/sh；例如，以下几行
sort -t $'\t' -k 1 "$f" > temp
join -1 1 -2 1 -t $'\t' result temp > newresult
可以替换为
join -1 1 -2 1 -t $'\t' result <(sort -t $'\t' -k 1 "$f") > newresult

Answer

回答者：

shell 脚本 – 合并一些制表符分隔的文件

以下脚本应该对所有作为参数传递的制表符分隔文件的第 1 列（字段）进行外连接。它使用加入命令，对已排序的文件进行外连接，每次 2 个文件。

它将连接文件中的每一行，包括标题行。如果您希望排除标题，请将这两个sort命令更改为生成忽略标题的排序文件的内容。
#!/bin/sh
if test $# -lt 2
then
    echo usage: gjoin file1 file2 ...
    exit 1
fi
sort -t $'\t' -k 1 "$1" > result
shift
for f in "$@"
do
    sort -t $'\t' -k 1 "$f" > temp
    join -1 1 -2 1 -t $'\t' result temp > newresult
    mv newresult result
done
cat result
rm result temp
如果您使用的是旧版 shell，$'\t'则不会被制表符替换，因此您需要使用 'TAB'，在引号之间放置文字制表符。

如果您可以使用现代 shell（例如 bash 或 ksh），则可以进行优化/bin/sh；例如，以下几行
sort -t $'\t' -k 1 "$f" > temp
join -1 1 -2 1 -t $'\t' result temp > newresult
可以替换为
join -1 1 -2 1 -t $'\t' result <(sort -t $'\t' -k 1 "$f") > newresult

如何合并制表符分隔的文件？

答案1

答案2

答案3

添加更多文件

答案4

相关内容