我有一个文件A.tsv
(字段分隔符\t
:):
BC02 Streptococcus oralis chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597
BC02 Staphylococcus aureus chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597
BC02 Streptococcus sp. chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597
我想在每行末尾添加一个新列,仅包含该列的前两个单词$2
,以获得:
BC02 Streptococcus oralis chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus oralis
BC02 Staphylococcus aureus chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Staphylococcus aureus
BC02 Streptococcus sp. chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus sp.
我尝试使用这个查看一些 awk 命令堆栈主题但没有什么相似之处足以作为一个开始。
你知道怎么做吗 ?
答案1
尝试分裂()第二列空格并打印您想要多少个单词,例如
awk 'BEGIN{ FS=OFS="\t" }
{ split($2, tmp, " "); print $0, tmp[1], tmp[2] }' infile
答案2
对于更复杂的情况tsv
,例如,如果字段内有选项卡,则awk
效果不佳。然后,您应该使用适当的 CSV 解析器,例如python
的csv
模块:
#!/usr/bin/env python3
import csv
with open('A.tsv') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
row.append(' '.join(row[1].split()[:2]))
print('\t'.join(row))
答案3
使用 GNU awk forgensub()
和\s/\S
简写:
$ awk '{print gensub(/\S+\s+(\S+\s+\S+).*/,"&\t\\1",1)}' file
BC02 Streptococcus oralis chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus oralis
BC02 Staphylococcus aureus chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Staphylococcus aureus
BC02 Streptococcus sp. chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus sp.
或者使用 GNU sed 更简短一些:
$ sed -E 's/\S+\s+(\S+\s+\S+).*/&\t\1/' file
BC02 Streptococcus oralis chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus oralis
BC02 Staphylococcus aureus chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Staphylococcus aureus
BC02 Streptococcus sp. chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus sp.
上面假设第一个字段不包含任何空格,如示例中所示。
答案4
使用 Raku(以前称为 Perl_6)
raku -ne 'print $_, "\t"; .split(/\t/).[1].words.[0..1].put;'
输入示例:
BC02 Streptococcus oralis chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597
BC02 Staphylococcus aureus chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597
BC02 Streptococcus sp. chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597
将上面的代码分为三 (3) 部分,
1)。在选项卡上拆分,拉出第二个元素(记住在 Raku 中编号从零开始):
raku -ne '.split(/\t/).[1].put;'
给出样本输出:
Streptococcus oralis chromosome, complete genome
Staphylococcus aureus chromosome, complete genome
Streptococcus sp. chromosome, complete genome
2)。分成空格分隔的words
,取前两个 (2):
raku -ne '.split(/\t/).[1].words.[0..1].put;'
给出样本输出:
Streptococcus oralis
Staphylococcus aureus
Streptococcus sp.
3)。通过$_
首先打印 Raku 主题变量(后跟\t
),将上面的内容与整个预先存在的行结合起来:
raku -ne 'print $_, "\t"; .split(/\t/).[1].words.[0..1].put;'
给出样本输出:
BC02 Streptococcus oralis chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus oralis
BC02 Staphylococcus aureus chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Staphylococcus aureus
BC02 Streptococcus sp. chromosome, complete genome 2712 94 0 99.073 2053209 CP023507.1 1597 Streptococcus sp.