input.txt
我在 Linux 机器上有两个文件,其中包含不同的列(,
作为分隔符)。我有一个脚本可以使用每个文件第一列中报告的 ID 来加入这些文件。此脚本保留输出中第一个文件的所有 ID,并且仅保留第二个文件的匹配 ID。我需要实现此脚本,添加一个选项来保留与第一个文件的 ID 不匹配的第二个文件的 ID。
例子:
2931,C,-9.750,-2.550,57.910,-0.3,C
2932,C,-5.470,-0.200,51.550,0.9,C
2940,C,-10.860,-3.400,54.000,0.7,C
2941,S,-11.820,-13.550,55.070,2.1,S
2944,H,-3.770,-4.180,60.300,0.7,H
输入2.txt
4304,N,-9.700,-7.680,58.330,-2.3,N
2940,S,-10.440,-3.450,54.270,2.2,S
2900,C,-13.655,-13.730,59.405,-1.5,C
2931,C,-9.910,-2.420,57.610,0.2,C
命令:
join -t, -a1 -o auto <(sort input1.txt) <(sort input2.txt) > output.txt.txt
输出.txt
2931,C,-9.750,-2.550,57.910,-0.3,C,2931,C,-9.910,-2.420,57.610,0.2,C
2932,C,-5.470,-0.200,51.550,0.9,C,,,,,,,
2940,C,-10.860,-3.400,54.000,0.7,C,2940,S,-10.440,-3.450,54.270,2.2,S
2941,S,-11.820,-13.550,55.070,2.1,S,,,,,,,
2944,H,-3.770,-4.180,60.300,0.7,H,,,,,,,
我想修改命令以获得两个输出文件。第一个应该与我现在得到的类似,但它也应该有不匹配的 ID:
输出最终.txt
2931,C,-9.750,-2.550,57.910,-0.3,C,2931,C,-9.910,-2.420,57.610,0.2,C
2932,C,-5.470,-0.200,51.550,0.9,C,,,,,,,
2940,C,-10.860,-3.400,54.000,0.7,C,2940,S,-10.440,-3.450,54.270,2.2,S
2941,S,-11.820,-13.550,55.070,2.1,S,,,,,,,
2944,H,-3.770,-4.180,60.300,0.7,H,,,,,,,
,,,,,,,2900,C,-13.655,-13.730,59.405,-1.5,C
,,,,,,,4304,N,-9.700,-7.680,58.330,-2.3,N
另一个输出文件应仅包含以下不匹配的行input2.txt
:
输出2.txt
2900,C,-13.655,-13.730,59.405,-1.5,C
4304,N,-9.700,-7.680,58.330,-2.3,N
此外,如果在 input2.txt 中,我想仅将 ID 等于或大于 4000 的行的最后一列的元素替换为字符串“P”,我该怎么办?
即我只想将第一行(ID = 4304)的最后一个“C”替换为“P”
输出.txt
4304,N,-9.700,-7.680,58.330,-2.3,P
2940,S,-10.440,-3.450,54.270,2.2,S
2900,C,-13.655,-13.730,59.405,-1.5,C
2931,C,-9.910,-2.420,57.610,0.2,C
答案1
工作一:
假设ID
文件中的 是唯一的,您可以awk
按如下方式使用:
awk -F, -v OFS=, '
NR == FNR {
m[$1] = $0
while (i++ <= NF) empty = OFS empty
next
}
!m[$1]{$0 = $0 OFS empty}
m[$1]{$0 = $0 OFS m[$1];delete m[$1]}
1
END{
for ( i in m )
if(m[i]) print empty, m[i]
}
' file2 file1
请注意,您不需要对文件进行排序。每当遇到公共字段时,请将其从数组中删除。最后,数组将只保存刚刚出现的内容file2
工作2:
awk -F, 'NR == FNR {m[$1];next} !($1 in m)' file1 file2
将前两个内容放入带有输出重定向的 shell 脚本中:
#!/bin/bash
# first awk cmd
... > output1.txt
# Second awk cmd
... > output2.txt
答案2
join
通过告知包含所有字段,您可以获得所需的第一个输出文件,即包含两个文件中所有 ID 的文件:
$ join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \
<(sort input1.txt) <(sort input2.txt)
,,,,,,,2900,C,C,-13.730,59.405,-1.5,C
2931,C,C,-2.550,57.910,-0.3,C,2931,C,C,-2.420,57.610,0.2,C
2932,C,C,-0.200,51.550,0.9,C,,,,,,,
2940,C,C,-3.400,54.000,0.7,C,2940,S,S,-3.450,54.270,2.2,S
2941,S,S,-13.550,55.070,2.1,S,,,,,,,
2944,H,H,-4.180,60.300,0.7,H,,,,,,,
,,,,,,,4304,N,N,-7.680,58.330,-2.3,N
请注意,该顺序与您显示的顺序不同,因为这是在文件中找到的顺序(,,,,,,,2900...
首先出现在 中sort input2.txt
)。
,
然后,您可以通过解析第一个输出文件并查找以一个或多个字符开头的行来获取第二个输出文件:
$ join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \
<(sort input1.txt) <(sort input2.txt) | grep -oP '^,+\K.*'
2900,C,C,-13.730,59.405,-1.5,C
4304,N,N,-7.680,58.330,-2.3,N
该-o
选项指示grep
仅打印该行的匹配部分并-P
启用 Perl 兼容正则表达式。然后 PCRE 给我们\K
这意味着“忽略与这里匹配的任何内容”,这让我们只打印 延伸后的部分,
。
您可以将它们组合成一个命令,制作两个文件,用于tee
将第一个输出写入文件,并将其写入标准输出,然后您可以grep
如上所示运行:
join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \
<(sort input1.txt) <(sort input2.txt) |
tee output1.txt | grep -oP '^,+\K.*' > output2.txt
最终输出是:
$ cat output1.txt
,,,,,,,2900,C,C,-13.730,59.405,-1.5,C
2931,C,C,-2.550,57.910,-0.3,C,2931,C,C,-2.420,57.610,0.2,C
2932,C,C,-0.200,51.550,0.9,C,,,,,,,
2940,C,C,-3.400,54.000,0.7,C,2940,S,S,-3.450,54.270,2.2,S
2941,S,S,-13.550,55.070,2.1,S,,,,,,,
2944,H,H,-4.180,60.300,0.7,H,,,,,,,
,,,,,,,4304,N,N,-7.680,58.330,-2.3,N
$ cat output2.txt
2900,C,C,-13.730,59.405,-1.5,C
4304,N,N,-7.680,58.330,-2.3,N