从匹配列表中获取数据

Question 1

如果您稍微修改一下提供的解决方案吉尔斯在这问题（也称为jw013），您可以获得您要求的效果，只不过顺序是基于输入序列并且与output.txt问题中列出的顺序不同：

awk -v patterns_file=list.txt '
BEGIN {
  while (getline < patterns_file)
    patterns_array[">" $0] = 1
  close(patterns_file)
}
$0 in patterns_array { print; getline; print }
' sample[1-3].txt

输出：

>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample1
>GETID_11084_knownids_3/3_Confidence_0.600_Length_1451
sampletextforsample2
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample3
>GETID_15916_knownids_10/11_Confidence_0.324_Length_1825
sample2textforsample3

编辑

要使多行记录工作，请使用适当的记录分隔符 ( RS)，在您的情况下将其设置为：greater-than在文件开头 ( ^>) 或new-line后跟greater-than( \n>) 或new-line在文件末尾 ( \n$) 似乎是不错的选择，基于提供的输入。

像这样的东西应该有效：

awk -v patterns_file=patterns.txt '
BEGIN {
  while (getline < patterns_file) 
    patterns_array[$0] = 1
  close(patterns_file)
  RS="^>|\n>|\n$"
}
$1 in patterns_array { print ">" $0 }
' sample[1-3].txt

编辑2

要仅输出每条记录一次，请将其从patterns_array后输出中删除：

awk -v patterns_file=patterns.txt '
BEGIN {
  while (getline < patterns_file) 
    patterns_array[$0] = 1
  close(patterns_file)
  RS="^>|\n>|\n$"
}
$1 in patterns_array { print ">" $0; delete patterns_array[$1] }
' sample[1-3].txt

Answer

如果您稍微修改一下提供的解决方案吉尔斯在这问题（也称为jw013），您可以获得您要求的效果，只不过顺序是基于输入序列并且与output.txt问题中列出的顺序不同：

awk -v patterns_file=list.txt '
BEGIN {
  while (getline < patterns_file)
    patterns_array[">" $0] = 1
  close(patterns_file)
}
$0 in patterns_array { print; getline; print }
' sample[1-3].txt

输出：

>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample1
>GETID_11084_knownids_3/3_Confidence_0.600_Length_1451
sampletextforsample2
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample3
>GETID_15916_knownids_10/11_Confidence_0.324_Length_1825
sample2textforsample3

编辑

要使多行记录工作，请使用适当的记录分隔符 ( RS)，在您的情况下将其设置为：greater-than在文件开头 ( ^>) 或new-line后跟greater-than( \n>) 或new-line在文件末尾 ( \n$) 似乎是不错的选择，基于提供的输入。

像这样的东西应该有效：

awk -v patterns_file=patterns.txt '
BEGIN {
  while (getline < patterns_file) 
    patterns_array[$0] = 1
  close(patterns_file)
  RS="^>|\n>|\n$"
}
$1 in patterns_array { print ">" $0 }
' sample[1-3].txt

编辑2

要仅输出每条记录一次，请将其从patterns_array后输出中删除：

awk -v patterns_file=patterns.txt '
BEGIN {
  while (getline < patterns_file) 
    patterns_array[$0] = 1
  close(patterns_file)
  RS="^>|\n>|\n$"
}
$1 in patterns_array { print ">" $0; delete patterns_array[$1] }
' sample[1-3].txt

Question 2

这是一个 PERL 解决方案。它将适用于任意数量的文件，并期望第一个文件是列表。它还会将文件名附加到 FASTA 标头中。

#!/usr/bin/perl -w
use strict;
my $list=shift;
open(A,$list); 
my %k;
while(<A>){
    ## Remove trailing newline
    chomp;
    if ( /(\d+?)_knownids_(.+?)_.+?(\d+)$/){ 
      ## Concatenate the patterns and save in a hash
      my $pp=join("-", $1,$2,$3);
      $k{PAT}{$pp}=$_;
    }
}
close(A);
## Read each input file
my $name;
for my $f (@ARGV) {
    open(F,$f);
    while(<F>){
       ## Skip empty lines
       next if /^\s*$/;
       ## Is this a FASTA header?
       if ( /^\s*>/){
           ## If this id is in the list, keep it for this file
           if(/(\d+?)_knownids_(.+?)_.+?(\d+)$/){ 
              $name=join("-", $1,$2,$3);
           }
           ## Skip the sequences we are not interested in
           else{$name="foo"}
       }
       ## Collect the sequence
       else {
           if (defined($k{PAT}{$name})) {
           $k{$f}{$name}.=$_;
           }   
       } 
    }
    close(F);
}
## For each unique pattern found in list.txt
foreach my $pat (keys(%{$k{PAT}})) {
    ## For each of the files passed as arguments
    foreach my $file (@ARGV) {
    ## If the pattern was found in that file, print
    if (defined($k{$file}{$pat})) {
          print ">$k{PAT}{$pat}_$file\n";  
          print "$k{$file}{$pat}"
        }
    }
}

如果脚本另存为compare.pl，请执行以下操作：

$ ./compare.pl list.txt sample1.txt sample2.txt sample3.txt sampleN.txt

输出是：

> GETID_11084_knownids_3/3_Confidence_0.600_Length_1451_sample2.txt
sampletextforsample2
> GETID_17049_knownids_1/2_Confidence_0.625_Length_2532_sample1.txt
sampletextforsample1
> GETID_17049_knownids_1/2_Confidence_0.625_Length_2532_sample3.txt
sampletextforsample3
> GETID_15916_knownids_10/11_Confidence_0.324_Length_1825_sample3.txt
sample2textforsample3

Answer

这是一个 PERL 解决方案。它将适用于任意数量的文件，并期望第一个文件是列表。它还会将文件名附加到 FASTA 标头中。

#!/usr/bin/perl -w
use strict;
my $list=shift;
open(A,$list); 
my %k;
while(<A>){
    ## Remove trailing newline
    chomp;
    if ( /(\d+?)_knownids_(.+?)_.+?(\d+)$/){ 
      ## Concatenate the patterns and save in a hash
      my $pp=join("-", $1,$2,$3);
      $k{PAT}{$pp}=$_;
    }
}
close(A);
## Read each input file
my $name;
for my $f (@ARGV) {
    open(F,$f);
    while(<F>){
       ## Skip empty lines
       next if /^\s*$/;
       ## Is this a FASTA header?
       if ( /^\s*>/){
           ## If this id is in the list, keep it for this file
           if(/(\d+?)_knownids_(.+?)_.+?(\d+)$/){ 
              $name=join("-", $1,$2,$3);
           }
           ## Skip the sequences we are not interested in
           else{$name="foo"}
       }
       ## Collect the sequence
       else {
           if (defined($k{PAT}{$name})) {
           $k{$f}{$name}.=$_;
           }   
       } 
    }
    close(F);
}
## For each unique pattern found in list.txt
foreach my $pat (keys(%{$k{PAT}})) {
    ## For each of the files passed as arguments
    foreach my $file (@ARGV) {
    ## If the pattern was found in that file, print
    if (defined($k{$file}{$pat})) {
          print ">$k{PAT}{$pat}_$file\n";  
          print "$k{$file}{$pat}"
        }
    }
}

如果脚本另存为compare.pl，请执行以下操作：

$ ./compare.pl list.txt sample1.txt sample2.txt sample3.txt sampleN.txt

输出是：

> GETID_11084_knownids_3/3_Confidence_0.600_Length_1451_sample2.txt
sampletextforsample2
> GETID_17049_knownids_1/2_Confidence_0.625_Length_2532_sample1.txt
sampletextforsample1
> GETID_17049_knownids_1/2_Confidence_0.625_Length_2532_sample3.txt
sampletextforsample3
> GETID_15916_knownids_10/11_Confidence_0.324_Length_1825_sample3.txt
sample2textforsample3

从匹配列表中获取数据

答案1

编辑

编辑2

答案2

相关内容