sed：查找模式并替换同一行中的另一个模式

Question 1

以下 perl 脚本尝试按顺序匹配每个输入行中的gene、product、和（即，它优先考虑基因优先于产品，优先考虑产品优先于 sprot）。sprot如果其中之一匹配，则提取匹配后的单词。假定该单词用双引号引起来。

如果找到匹配项，它将gene_id用提取的单词替换后面的单词。

无论是否修改该行都会被打印。

#!/usr/bin/perl

while (<>) {
  my $word = '';

  if (m/\b(?:gene)\s+("[^"]*")/) {
    $word = $1;
  } elsif (m/\b(?:product)\s+("[^"]*")/) {
    $word = $1;
  } elsif (m/\b(?:sprot)\s+("[^"]*")/) {
    $word = $1;
  };

  if ($word) {
    s/\bgene_id\s+(?:"[^"]*")/gene_id $word/
  };

  print;
}

或者，可以编写为使用循环来迭代匹配关键字：

#!/usr/bin/perl

while (<>) {
  my $word = '';

  foreach my $match (qw(gene product sprot)) {
    if (m/\b(?:$match)\s+("[^"]*")/) {
      $word = $1;
      last; # first match wins, exit this loop
    }
  };

  if ($word) {
    s/\bgene_id\s+(?:"[^"]*")/gene_id $word/
  };

  print;
}

IMO，这个版本更好，因为它更容易阅读和理解（特别是，循环foreach强调它是关于迭代单词列表）。更重要的是，它避免了重复该$word = $1语句 - 如果您需要更改它或添加额外的代码，如果您只需执行一次而不是三次，那么您就不太可能犯错误。 “不要重复自己”在像这样的小程序中并不那么重要，但在较大的程序中可能非常重要。无论如何，避免/最小化重复是良好的编程习惯。

如果匹配的顺序不重要（即，如果您不关心找到哪一个，只要找到一个），那么您可以简化脚本：

#!/usr/bin/perl

while (<>) {
  my ($word) = m/\b(?:gene|product|sprot)\s+("[^"]*")/;

  if ($word) {
    s/\bgene_id\s+(?:"[^"]*")/gene_id $word/
  };

  print;
}

无论您使用哪个版本的脚本，都将其另存为例如replace.pl，并使其可执行chmod +x replace.pl。或者将它们全部尝试为replace1.pl, replace2.pl, replace3.pl。然后像这样运行它：

$ ./replace.pl input.txt 
chrM    Gnomon  CDS 8345    8513    .   +   1   gene_id "semaphorin-3F"; transcript_id "cds-XP_008824843.3"; Parent "rna-XM_008826621.3"; Dbxref "GeneID:103728653_Genbank:XP_008824843.3"; Name "XP_008824843.3"; end_range "8513,."; gbkey "CDS"; gene "semaphorin-3F"; partial "true"; product "semaphorin-3F"; protein_id "XP_008824843.3"; sprot "sp|Q13275|SEM3F_HUMAN";
chrM    StringTie   exon    2754    3700    .   +   .   gene_id "ND1"; transcript_id "cds-YP_007626758.1"; Parent "gene-ND1"; Dbxref "Genbank:YP_007626758.1,Gene "ID:15088436"; Name "YP_007626758.1"; Note "TAAstopcodoniscompletedbytheadditionof3'AresiduestothemRNA"; gbkey "CDS"; gene "ND1"; product "NADHdehydrogenasesubunit1"; protein_id "YP_007626758.1"; transl_except "(pos:3700..3700%2Caa:TERM)"; transl_table "2";

Answer

以下 perl 脚本尝试按顺序匹配每个输入行中的gene、product、和（即，它优先考虑基因优先于产品，优先考虑产品优先于 sprot）。sprot如果其中之一匹配，则提取匹配后的单词。假定该单词用双引号引起来。

如果找到匹配项，它将gene_id用提取的单词替换后面的单词。

无论是否修改该行都会被打印。

#!/usr/bin/perl

while (<>) {
  my $word = '';

  if (m/\b(?:gene)\s+("[^"]*")/) {
    $word = $1;
  } elsif (m/\b(?:product)\s+("[^"]*")/) {
    $word = $1;
  } elsif (m/\b(?:sprot)\s+("[^"]*")/) {
    $word = $1;
  };

  if ($word) {
    s/\bgene_id\s+(?:"[^"]*")/gene_id $word/
  };

  print;
}

或者，可以编写为使用循环来迭代匹配关键字：

#!/usr/bin/perl

while (<>) {
  my $word = '';

  foreach my $match (qw(gene product sprot)) {
    if (m/\b(?:$match)\s+("[^"]*")/) {
      $word = $1;
      last; # first match wins, exit this loop
    }
  };

  if ($word) {
    s/\bgene_id\s+(?:"[^"]*")/gene_id $word/
  };

  print;
}

IMO，这个版本更好，因为它更容易阅读和理解（特别是，循环foreach强调它是关于迭代单词列表）。更重要的是，它避免了重复该$word = $1语句 - 如果您需要更改它或添加额外的代码，如果您只需执行一次而不是三次，那么您就不太可能犯错误。 “不要重复自己”在像这样的小程序中并不那么重要，但在较大的程序中可能非常重要。无论如何，避免/最小化重复是良好的编程习惯。

如果匹配的顺序不重要（即，如果您不关心找到哪一个，只要找到一个），那么您可以简化脚本：

#!/usr/bin/perl

while (<>) {
  my ($word) = m/\b(?:gene|product|sprot)\s+("[^"]*")/;

  if ($word) {
    s/\bgene_id\s+(?:"[^"]*")/gene_id $word/
  };

  print;
}

无论您使用哪个版本的脚本，都将其另存为例如replace.pl，并使其可执行chmod +x replace.pl。或者将它们全部尝试为replace1.pl, replace2.pl, replace3.pl。然后像这样运行它：

$ ./replace.pl input.txt 
chrM    Gnomon  CDS 8345    8513    .   +   1   gene_id "semaphorin-3F"; transcript_id "cds-XP_008824843.3"; Parent "rna-XM_008826621.3"; Dbxref "GeneID:103728653_Genbank:XP_008824843.3"; Name "XP_008824843.3"; end_range "8513,."; gbkey "CDS"; gene "semaphorin-3F"; partial "true"; product "semaphorin-3F"; protein_id "XP_008824843.3"; sprot "sp|Q13275|SEM3F_HUMAN";
chrM    StringTie   exon    2754    3700    .   +   .   gene_id "ND1"; transcript_id "cds-YP_007626758.1"; Parent "gene-ND1"; Dbxref "Genbank:YP_007626758.1,Gene "ID:15088436"; Name "YP_007626758.1"; Note "TAAstopcodoniscompletedbytheadditionof3'AresiduestothemRNA"; gbkey "CDS"; gene "ND1"; product "NADHdehydrogenasesubunit1"; protein_id "YP_007626758.1"; transl_except "(pos:3700..3700%2Caa:TERM)"; transl_table "2";

Question 2

我们利用哈希的属性，如果多个值应用于给定键，则最后一个将成为最终值。

perl -lpe 'my($l,%h)=($_);
  $h{gene_id}=$_ for map {
     $l =~ /\b$_\s+(".*?");/
  } reverse qw(gene product sprot);
  s/\bgene_id\s+\K".*?";/$h{gene_id};/;
' your_file_genes

由于命令都是相同的，只有名称发生变化，因此我们可以轻松地驱动整个操作表，其中我们只需提供字段名称，而 for 循环将处理其余的事情。

for i in gene product sprot;do
  cat - <<\_FMT_ |\
  sed -e "s/%s/$i/"
s/(\<gene_id\s+)"[^"]*"(.*\s%s\s+("[^"]*"))/\1\3\2/;t
_FMT_
done | sed -Ef - your_file_genes

Answer

我们利用哈希的属性，如果多个值应用于给定键，则最后一个将成为最终值。

perl -lpe 'my($l,%h)=($_);
  $h{gene_id}=$_ for map {
     $l =~ /\b$_\s+(".*?");/
  } reverse qw(gene product sprot);
  s/\bgene_id\s+\K".*?";/$h{gene_id};/;
' your_file_genes

由于命令都是相同的，只有名称发生变化，因此我们可以轻松地驱动整个操作表，其中我们只需提供字段名称，而 for 循环将处理其余的事情。

for i in gene product sprot;do
  cat - <<\_FMT_ |\
  sed -e "s/%s/$i/"
s/(\<gene_id\s+)"[^"]*"(.*\s%s\s+("[^"]*"))/\1\3\2/;t
_FMT_
done | sed -Ef - your_file_genes

Question 3

要完成该perl解决方案，请按照以下方式使用sed.我不确定您期望给定的语法如何工作，但实际上您需要一个正则表达式来匹配字符串

... gene_id "remove me" ... some other stuff gene "replacement" ... more stuff
    =======                                  ====
    gene_id   "[^"]*"        .*              gene    "[^"]*"

gene_id并且gene是自己匹配的。双引号中的字符串是双引号、任意数量的非双引号 ( [^"]*) 字符和另一个双引号的串联。最后你有了介于两者之间的东西.*

现在您需要在更换中放置需要回收的部件：

sed 's/gene_id "[^"]*"\(.* gene \("[^"]*"\)\)/gene_id \2\1/'

外面的一对覆盖了所有应该保持不变的东西。这可以像\1替换时一样重复使用。内部对是您想要重用为的字符串gene_id。

现在，如果您想要使用productorsprot作为替代替换，您可以使用扩展正则表达式的替代字符串：

sed -E 's/gene_id "[^"]*"(.*(gene|product|sprot) ("[^"]*"))/gene_id \3\1/'

但这不会优先选择over gene，而是优先选择最后一个存在的。如果您想获得该优先顺序，则需要单独的步骤并从最后一个开始，以便可以用更好的步骤替换它：productsprot

sed 's/gene_id "[^"]*"\(.* sprot \("[^"]*"\)\)/gene_id \2\1/
     s/gene_id "[^"]*"\(.* product \("[^"]*"\)\)/gene_id \2\1/
     s/gene_id "[^"]*"\(.* gene \("[^"]*"\)\)/gene_id \2\1/'

gene或者，如果已知和 sprot`的顺序product是固定的，您可以首先提取首选 ID，同时将实际行停放在保留空间中：

sed -E 'h;s/(sprot|product|gene) ("[^"]*").*/#\2/;s/.*#//;G;s/(.*)\n(.*gene_id )"[^"]*"/\2\1/'

标记#可以是已知不属于 ID 一部分的任何字符串；对于 GNU，sed你可以使用它\n来确定。因此，您用标记替换上述字符串中的第一个，并删除该行的其余部分，然后删除标记之前的所有内容，因此现在模式空间中只剩下 ID。然后，将添加G原始行（我们使用保留在保留缓冲区中h），然后用 ID（换行符之前的部分）替换"string"之后的内容gene_id。不知怎的，写起来比解释起来容易。

Answer