迭代读取两个大文件中的第 n 列，并排剪切并粘贴以创建新的第 n 个文件

Question 1

像这样，使用巴什和awk获取工具箱中的列数和常用命令：

#!/bin/bash

for i in  $(seq 1 $(awk '{print NF;exit}' test1.txt)); do
    paste <(sed 1d test1.txt | cut -d ' ' -f"$i") \
          <(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done

或者

#!/bin/bash

numcols=$(awk '{print NF;exit}' test1.txt)
for ((i=1; i<=numcols; i++)); do 
    paste <(sed 1d test1.txt | cut -d ' ' -f"$i") \
          <(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done

或使用克什:

#!/bin/ksh

numcols=$(awk '{print NF;exit}' test1.txt)
for i in  {1..$numcols}; do
    paste <(sed 1d test1.txt | cut -d ' ' -f"$i") \
          <(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done

然后：

cat out.1
cat out.2

正如评论中所解释的，你不能awk像shell你一样混合。

如果您不是开发人员，最好像我在这里一样学习基本的 shell 命令。

请阅读有关这些基本命令的文档：

tr
paste
sed
seq

而不是那么基本（这里以简单的方式使用）：

awk

处理替换>(command ...)或被<(...)临时文件名替换。写入或读取该文件会导致字节通过管道传输到内部命令。通常与文件重定向结合使用：cmd1 2> >(cmd2).

看：
http://mywiki.wooledge.org/ProcessSubstitution
http://mywiki.wooledge.org/BashFAQ/024

Answer

像这样，使用巴什和awk获取工具箱中的列数和常用命令：

#!/bin/bash

for i in  $(seq 1 $(awk '{print NF;exit}' test1.txt)); do
    paste <(sed 1d test1.txt | cut -d ' ' -f"$i") \
          <(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done

或者

#!/bin/bash

numcols=$(awk '{print NF;exit}' test1.txt)
for ((i=1; i<=numcols; i++)); do 
    paste <(sed 1d test1.txt | cut -d ' ' -f"$i") \
          <(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done

或使用克什:

#!/bin/ksh

numcols=$(awk '{print NF;exit}' test1.txt)
for i in  {1..$numcols}; do
    paste <(sed 1d test1.txt | cut -d ' ' -f"$i") \
          <(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done

然后：

cat out.1
cat out.2

正如评论中所解释的，你不能awk像shell你一样混合。

如果您不是开发人员，最好像我在这里一样学习基本的 shell 命令。

请阅读有关这些基本命令的文档：

tr
paste
sed
seq

而不是那么基本（这里以简单的方式使用）：

awk

处理替换>(command ...)或被<(...)临时文件名替换。写入或读取该文件会导致字节通过管道传输到内部命令。通常与文件重定向结合使用：cmd1 2> >(cmd2).

看：
http://mywiki.wooledge.org/ProcessSubstitution
http://mywiki.wooledge.org/BashFAQ/024

Question 2

假设：

所有文件至少有一行（标题）
所有文件的行数相同
所有文件具有相同的列数
所有文件都可以放入内存（通过数组awk）

一般的做法：

我们GNU awk可以使用多维数组，但副作用是我们会使用更多的内存（比单维索引）
column # (NF)将数据存储在索引为+ row number (FNR)+的一维数组中file count
在END{...}块中，我们循环遍历数组将数据打印到out{1..NF}文件

仅使用awk：

$ cat merge.awk

FNR==1 { fcnt++ }                                       # keep track of number of files
FNR>1  { for (i=1; i<=NF; i++)                          # loop through columns
             lines[i,FNR,fcnt]=$i                       # index = column # + row number + file count
       }
END    { for (i=1; i<=NF; i++) {                        # loop through columns
             for (j=2; j<=FNR; j++)                     # loop through rows
                 for (k=1; k<=fcnt; k++)                # loop through filecount
                     printf "%s%s", lines[i,j,k], (k<fcnt ? OFS : ORS), lines[i,j,k] > ("out" i)
             close ("out" i)
         }
       }

针对OP的两个文件运行：

$ awk -f merge.awk test1.txt test2.txt

$ head out?
==> out1 <==
1 2
1 1
1 2
2 2

==> out2 <==
2 2
2 1
1 1
1 2

三个新文件：

$ head t?.txt
==> t1.txt <==
rr1 rr2 rr3
1 2 3
4 5 6
7 8 9

==> t2.txt <==
rr1 rr2 rr3
a b c
d e f
g h i

==> t3.txt <==
rr1 rr2 rr3
X XX XXX
Y YY YYY
Z ZZ ZZZ

针对这三个文件运行：

$ awk -f merge.awk t1.txt t2.txt t3.txt

$ head out?
==> out1 <==
1 a X
4 d Y
7 g Z

==> out2 <==
2 b XX
5 e YY
8 h ZZ

==> out3 <==
3 c XXX
6 f YYY
9 i ZZZ

Answer

假设：

所有文件至少有一行（标题）
所有文件的行数相同
所有文件具有相同的列数
所有文件都可以放入内存（通过数组awk）

一般的做法：

我们GNU awk可以使用多维数组，但副作用是我们会使用更多的内存（比单维索引）
column # (NF)将数据存储在索引为+ row number (FNR)+的一维数组中file count
在END{...}块中，我们循环遍历数组将数据打印到out{1..NF}文件

仅使用awk：

$ cat merge.awk

FNR==1 { fcnt++ }                                       # keep track of number of files
FNR>1  { for (i=1; i<=NF; i++)                          # loop through columns
             lines[i,FNR,fcnt]=$i                       # index = column # + row number + file count
       }
END    { for (i=1; i<=NF; i++) {                        # loop through columns
             for (j=2; j<=FNR; j++)                     # loop through rows
                 for (k=1; k<=fcnt; k++)                # loop through filecount
                     printf "%s%s", lines[i,j,k], (k<fcnt ? OFS : ORS), lines[i,j,k] > ("out" i)
             close ("out" i)
         }
       }

针对OP的两个文件运行：

$ awk -f merge.awk test1.txt test2.txt

$ head out?
==> out1 <==
1 2
1 1
1 2
2 2

==> out2 <==
2 2
2 1
1 1
1 2

三个新文件：

$ head t?.txt
==> t1.txt <==
rr1 rr2 rr3
1 2 3
4 5 6
7 8 9

==> t2.txt <==
rr1 rr2 rr3
a b c
d e f
g h i

==> t3.txt <==
rr1 rr2 rr3
X XX XXX
Y YY YYY
Z ZZ ZZZ

针对这三个文件运行：

$ awk -f merge.awk t1.txt t2.txt t3.txt

$ head out?
==> out1 <==
1 a X
4 d Y
7 g Z

==> out2 <==
2 b XX
5 e YY
8 h ZZ

==> out3 <==
3 c XXX
6 f YYY
9 i ZZZ

Question 3

该错误源于以下事实：您不能在单引号字符串中使用单引号。该awk命令将程序视为paste -d程序awk（包括由于被截断而导致的语法错误）和代码的其余部分，直到下一个未加引号的空格，作为要处理的第一个文件名等。您也不能使用 shell程序内的命令awk。

以下管道将两个文件并排输入到awk使用的命令中paste。该awk命令将每个文件中的列对输出到每列不同的输出文件中。

$ paste test1.txt test2.txt | awk 'NR > 1 { for (i = 1; i <= NF/2; ++i) print $i, $(NF/2+i) >("out" i) }'

$ cat out1
1 2
1 1
1 2
2 2

$ cat out2
2 2
2 1
1 1
1 2

代码awk，打印得很漂亮：

NR > 1 {
    for (i = 1; i <= NF/2; ++i)
        print $i, $(NF/2+i) > ("out" i)
}

忽略第一行输入的标题后，此代码将迭代其中一个文件的字段（我们假设两个文件具有相同数量的字段，并且两个文件中的字段应以相同的顺序配对），这些是NF/2字段，即我们给出的字段的一半。然后，它将第ith 个字段与该字段一起打印，方法是NF/2将该数字添加到一个名为out后跟字段编号的文件中i。

通过小的修改，您可以根据第一个文件中的标头来命名输出文件（我们忽略第二个文件中的标头并假设它们的顺序相同）：

NR == 1 {
    for (i = 1; i <= NF/2; ++i) head[i] = $i
    next
}

{
    for (i = 1; i <= NF/2; ++i)
        print $i, $(NF/2+i) > head[i]
}

根据问题中给出的数据，这将创建两个文件rr1并且rr2（或者如果它们已经存在则覆盖它们）。

正如下面的评论中正确指出的那样（评论现已删除），上面的内容可能会导致 100000 列的“打开文件太多”错误，并且awk其实现不能智能地管理打开文件描述符池（就像 GNUawk那样）。在其他awk实现中，您需要在每次之后关闭输出文件print，并使用>>（用于附加）而不是使用>.

awk这是上面最后一个片段的改编版本：

NR == 1 {
    for (i = 1; i <= NF/2; ++i) head[i] = $i
    next
}

{
    for (i = 1; i <= NF/2; ++i) {
        print $i, $(NF/2+i) >> head[i]
        close(head[i])
    }
}

Answer