列表.txt
GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
GETID_9248_knownids_6/10_Confidence_0.439_Length_2474
GETID_11084_knownids_3/3_Confidence_0.600_Length_1451
GETID_15916_knownids_10/11_Confidence_0.324_Length_1825
样本1.txt
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample1
sampletextforsample1
sampletextforsample1
>GETID_18457_knownids_1/2_Confidence_0.625_Length_2532
sample2textforsample1
sample2textforsample1
sample2textforsample1
sample2textforsample1
样本2.txt
>GETID_11084_knownids_3/3_Confidence_0.600_Length_1451
sampletextforsample2
sampletextforsample2
>GETID_67838_knownids_3/3_Confidence_0.600_Length_1451
sample2textforsample2
sample2textforsample2
样本3.txt
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample3
sampletextforsample3
sampletextforsample3
>GETID_15916_knownids_10/11_Confidence_0.324_Length_1825
sample2textforsample3
sample2textforsample3
输出.txt
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample1
sampletextforsample1
sampletextforsample1
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample3
sampletextforsample3
sampletextforsample3
>GETID_11084_knownids_3/3_Confidence_0.600_Length_1451
sampletextforsample2
sampletextforsample2
>GETID_15916_knownids_10/11_Confidence_0.324_Length_1825
sample2textforsample3
sample2textforsample3
我想读取 list.txt 中的每一行(通过捕获括号内的值 (GETID_{17049}已知的{1/2}_Confidence_1.0_Length_{2532})并与 example1.txt、sample2.txt、sample3.txt 进行比较,其中有多行,并在与 list.txt 匹配时打印这些文件中的内容 (output.txt)。输出应包含与 list.txt 完全匹配的内容。任何 awk/sed/perl 方面的帮助都值得赞赏。
答案1
如果您稍微修改一下提供的解决方案吉尔斯在这问题(也称为jw013),您可以获得您要求的效果,只不过顺序是基于输入序列并且与output.txt
问题中列出的顺序不同:
awk -v patterns_file=list.txt '
BEGIN {
while (getline < patterns_file)
patterns_array[">" $0] = 1
close(patterns_file)
}
$0 in patterns_array { print; getline; print }
' sample[1-3].txt
输出:
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample1
>GETID_11084_knownids_3/3_Confidence_0.600_Length_1451
sampletextforsample2
>GETID_17049_knownids_1/2_Confidence_0.625_Length_2532
sampletextforsample3
>GETID_15916_knownids_10/11_Confidence_0.324_Length_1825
sample2textforsample3
编辑
要使多行记录工作,请使用适当的记录分隔符 ( RS
),在您的情况下将其设置为:greater-than
在文件开头 ( ^>
) 或new-line
后跟greater-than
( \n>
) 或new-line
在文件末尾 ( \n$
) 似乎是不错的选择,基于提供的输入。
像这样的东西应该有效:
awk -v patterns_file=patterns.txt '
BEGIN {
while (getline < patterns_file)
patterns_array[$0] = 1
close(patterns_file)
RS="^>|\n>|\n$"
}
$1 in patterns_array { print ">" $0 }
' sample[1-3].txt
编辑2
要仅输出每条记录一次,请将其从patterns_array
后输出中删除:
awk -v patterns_file=patterns.txt '
BEGIN {
while (getline < patterns_file)
patterns_array[$0] = 1
close(patterns_file)
RS="^>|\n>|\n$"
}
$1 in patterns_array { print ">" $0; delete patterns_array[$1] }
' sample[1-3].txt
答案2
这是一个 PERL 解决方案。它将适用于任意数量的文件,并期望第一个文件是列表。它还会将文件名附加到 FASTA 标头中。
#!/usr/bin/perl -w
use strict;
my $list=shift;
open(A,$list);
my %k;
while(<A>){
## Remove trailing newline
chomp;
if ( /(\d+?)_knownids_(.+?)_.+?(\d+)$/){
## Concatenate the patterns and save in a hash
my $pp=join("-", $1,$2,$3);
$k{PAT}{$pp}=$_;
}
}
close(A);
## Read each input file
my $name;
for my $f (@ARGV) {
open(F,$f);
while(<F>){
## Skip empty lines
next if /^\s*$/;
## Is this a FASTA header?
if ( /^\s*>/){
## If this id is in the list, keep it for this file
if(/(\d+?)_knownids_(.+?)_.+?(\d+)$/){
$name=join("-", $1,$2,$3);
}
## Skip the sequences we are not interested in
else{$name="foo"}
}
## Collect the sequence
else {
if (defined($k{PAT}{$name})) {
$k{$f}{$name}.=$_;
}
}
}
close(F);
}
## For each unique pattern found in list.txt
foreach my $pat (keys(%{$k{PAT}})) {
## For each of the files passed as arguments
foreach my $file (@ARGV) {
## If the pattern was found in that file, print
if (defined($k{$file}{$pat})) {
print ">$k{PAT}{$pat}_$file\n";
print "$k{$file}{$pat}"
}
}
}
如果脚本另存为compare.pl
,请执行以下操作:
$ ./compare.pl list.txt sample1.txt sample2.txt sample3.txt sampleN.txt
输出是:
> GETID_11084_knownids_3/3_Confidence_0.600_Length_1451_sample2.txt
sampletextforsample2
> GETID_17049_knownids_1/2_Confidence_0.625_Length_2532_sample1.txt
sampletextforsample1
> GETID_17049_knownids_1/2_Confidence_0.625_Length_2532_sample3.txt
sampletextforsample3
> GETID_15916_knownids_10/11_Confidence_0.324_Length_1825_sample3.txt
sample2textforsample3