如何使用awk只打印某一列的前两个单词

Question 1

尝试分裂（）第二列空格并打印您想要多少个单词，例如

awk 'BEGIN{ FS=OFS="\t" }
{ split($2, tmp, " "); print $0, tmp[1], tmp[2] }' infile

Answer

尝试分裂（）第二列空格并打印您想要多少个单词，例如

awk 'BEGIN{ FS=OFS="\t" }
{ split($2, tmp, " "); print $0, tmp[1], tmp[2] }' infile

Question 2

对于更复杂的情况tsv，例如，如果字段内有选项卡，则awk效果不佳。然后，您应该使用适当的 CSV 解析器，例如python的csv模块：

#!/usr/bin/env python3
import csv
with open('A.tsv') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
        row.append(' '.join(row[1].split()[:2]))
        print('\t'.join(row))

Answer

对于更复杂的情况tsv，例如，如果字段内有选项卡，则awk效果不佳。然后，您应该使用适当的 CSV 解析器，例如python的csv模块：

#!/usr/bin/env python3
import csv
with open('A.tsv') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
        row.append(' '.join(row[1].split()[:2]))
        print('\t'.join(row))

Question 3

使用 GNU awk forgensub()和\s/\S简写：

$ awk '{print gensub(/\S+\s+(\S+\s+\S+).*/,"&\t\\1",1)}' file
BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus oralis
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Staphylococcus aureus
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597        Streptococcus sp.

或者使用 GNU sed 更简短一些：

$ sed -E 's/\S+\s+(\S+\s+\S+).*/&\t\1/' file
BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus oralis
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Staphylococcus aureus
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597        Streptococcus sp.

上面假设第一个字段不包含任何空格，如示例中所示。

Answer

使用 GNU awk forgensub()和\s/\S简写：

$ awk '{print gensub(/\S+\s+(\S+\s+\S+).*/,"&\t\\1",1)}' file
BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus oralis
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Staphylococcus aureus
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597        Streptococcus sp.

或者使用 GNU sed 更简短一些：

$ sed -E 's/\S+\s+(\S+\s+\S+).*/&\t\1/' file
BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus oralis
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Staphylococcus aureus
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597        Streptococcus sp.

上面假设第一个字段不包含任何空格，如示例中所示。

Question 4

使用 Raku（以前称为 Perl_6）

raku -ne 'print $_, "\t"; .split(/\t/).[1].words.[0..1].put;'

输入示例：

BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597

将上面的代码分为三 (3) 部分，

1）。在选项卡上拆分，拉出第二个元素（记住在 Raku 中编号从零开始）：

raku -ne '.split(/\t/).[1].put;'

给出样本输出：

Streptococcus oralis  chromosome, complete genome
Staphylococcus aureus  chromosome, complete genome
Streptococcus sp.  chromosome, complete genome

2）。分成空格分隔的words，取前两个 (2)：

raku -ne '.split(/\t/).[1].words.[0..1].put;'

给出样本输出：

Streptococcus oralis
Staphylococcus aureus
Streptococcus sp.

3）。通过$_首先打印 Raku 主题变量（后跟\t），将上面的内容与整个预先存在的行结合起来：

raku -ne 'print $_, "\t"; .split(/\t/).[1].words.[0..1].put;'

给出样本输出：

BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus oralis
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Staphylococcus aureus
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus sp.

https://raku.org/

Answer

使用 Raku（以前称为 Perl_6）

raku -ne 'print $_, "\t"; .split(/\t/).[1].words.[0..1].put;'

输入示例：

BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597

将上面的代码分为三 (3) 部分，

1）。在选项卡上拆分，拉出第二个元素（记住在 Raku 中编号从零开始）：

raku -ne '.split(/\t/).[1].put;'

给出样本输出：

Streptococcus oralis  chromosome, complete genome
Staphylococcus aureus  chromosome, complete genome
Streptococcus sp.  chromosome, complete genome

2）。分成空格分隔的words，取前两个 (2)：

raku -ne '.split(/\t/).[1].words.[0..1].put;'

给出样本输出：

Streptococcus oralis
Staphylococcus aureus
Streptococcus sp.

3）。通过$_首先打印 Raku 主题变量（后跟\t），将上面的内容与整个预先存在的行结合起来：

raku -ne 'print $_, "\t"; .split(/\t/).[1].words.[0..1].put;'

给出样本输出：

BC02    Streptococcus oralis  chromosome, complete genome   2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus oralis
BC02    Staphylococcus aureus  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Staphylococcus aureus
BC02    Streptococcus sp.  chromosome, complete genome  2712    94  0   99.073  2053209 CP023507.1  1597    Streptococcus sp.

https://raku.org/

如何使用awk只打印某一列的前两个单词

答案1

答案2

答案3

答案4

相关内容