按特定模式合并 3 个文件中的特定行

Question 1

有了awk你就可以运行：

awk '/^ko[^:]/{fn=$1;next};/./{id=fn$1;if (!(seen[id]++)){print > fn}}' file[123]

在每个标题行上，它将标识符保存ko*****为fn，在子标题行上，它将保存fn$1¹作为id数组的索引seen，如果这是第一次id看到，则将该行写入fn。

^{1：你也可以使用fn$0}

Answer

有了awk你就可以运行：

awk '/^ko[^:]/{fn=$1;next};/./{id=fn$1;if (!(seen[id]++)){print > fn}}' file[123]

在每个标题行上，它将标识符保存ko*****为fn，在子标题行上，它将保存fn$1¹作为id数组的索引seen，如果这是第一次id看到，则将该行写入fn。

^{1：你也可以使用fn$0}

Question 2

可能有一些神奇的超级混搭命令，但有时“线性”最容易理解和维护。

因此，我们只需要根据标题行跟踪文件名并附加数据。然后我们可以通过sort -u结果来获得唯一的行：

#!/bin/bash

# Clean out old results from previous runs
/bin/rm -f ko*

for file in $@
do
  filename=UNKNOWN
  echo Processing $file
  while read -r line
  do
    case $line in
      ko:*) printf "%s\n" "$line" >> $filename ;;
       ko*) filename=${line%% *} ; echo Switching to $filename ;;
        "") # Do nothing
            ;;
         *) echo Ignoring unknown line: $line
    esac
  done < $file
done

for file in ko*
do
  echo Making unique: $file
  sort -u -o $file $file
done

我们可以使用三个源文件运行它：

$ ./pattern_split file1 file2 file3
Processing file1
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file2
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file3
Switching to ko00980
Switching to ko00982
Switching to ko00983
Making unique: ko00980
Making unique: ko00982
Making unique: ko00983

我们可以看到它创建了三个独一无二的文件。看第一个：

$ cat ko00980
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]

现在这个解决方案应该是硬化的针对数据文件中的流氓数据（例如，如果有文件怎么办ko123/456？那就会破坏。但这就是如何解决问题的概述。

Answer

可能有一些神奇的超级混搭命令，但有时“线性”最容易理解和维护。

因此，我们只需要根据标题行跟踪文件名并附加数据。然后我们可以通过sort -u结果来获得唯一的行：

#!/bin/bash

# Clean out old results from previous runs
/bin/rm -f ko*

for file in $@
do
  filename=UNKNOWN
  echo Processing $file
  while read -r line
  do
    case $line in
      ko:*) printf "%s\n" "$line" >> $filename ;;
       ko*) filename=${line%% *} ; echo Switching to $filename ;;
        "") # Do nothing
            ;;
         *) echo Ignoring unknown line: $line
    esac
  done < $file
done

for file in ko*
do
  echo Making unique: $file
  sort -u -o $file $file
done

我们可以使用三个源文件运行它：

$ ./pattern_split file1 file2 file3
Processing file1
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file2
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file3
Switching to ko00980
Switching to ko00982
Switching to ko00983
Making unique: ko00980
Making unique: ko00982
Making unique: ko00983

我们可以看到它创建了三个独一无二的文件。看第一个：

$ cat ko00980
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]

现在这个解决方案应该是硬化的针对数据文件中的流氓数据（例如，如果有文件怎么办ko123/456？那就会破坏。但这就是如何解决问题的概述。

Question 3

那么，根据标头将文件中的行移动到单独的文件中吗？

我认为这样的事情可以做到这一点：

#!/usr/bin/env perl
use strict;
use warnings 'all'; 

#hash of output filehandles. 
my %output_files; 

#detect dupes
my %seen; 

my $ko_num = 'NULL'; 

#<> is the 'magic' filehandle. You can either use it to iterate STDIN
#or take a list of file names on the command line (just like sed/grep etc.)
while ( my $line = <> ) { 
   #see if the line starts with 'ko':
   if ( $line =~ m/(^ko\d+)/) {  
       $ko_num = $1;
       #open a new file - for overwriting (so we only do this once)
       open ( $output_files{$ko_num}, '>', $ko_num ) or die $! unless $output_files{$ko_num}; 
       #skip printing - could write a header here instead. 
       next;
   }
   #look for a 'K' number. 
   if ( my ($K_id) = $line =~ m/ko:(K\d+)/ ) {
       #skip it if we've already seen this combination of 'ko' number 
       #and k number.    
       next if $seen{$ko_num}{$K_id}++; 
       #print the output to this particular output file. 
       print {$output_files{$ko_num}} $line; 
   }
}
#close the filehandles. 
close ( $_ ) for values %output_files;

因此，通过这种方式 - 您可以运行“myscript.pl file1.txt file2.txt file3.txt”，它应该以可扩展的方式做正确的事情。它甚至不关心它们是单独的文件还是单个流。

Answer

那么，根据标头将文件中的行移动到单独的文件中吗？

我认为这样的事情可以做到这一点：

#!/usr/bin/env perl
use strict;
use warnings 'all'; 

#hash of output filehandles. 
my %output_files; 

#detect dupes
my %seen; 

my $ko_num = 'NULL'; 

#<> is the 'magic' filehandle. You can either use it to iterate STDIN
#or take a list of file names on the command line (just like sed/grep etc.)
while ( my $line = <> ) { 
   #see if the line starts with 'ko':
   if ( $line =~ m/(^ko\d+)/) {  
       $ko_num = $1;
       #open a new file - for overwriting (so we only do this once)
       open ( $output_files{$ko_num}, '>', $ko_num ) or die $! unless $output_files{$ko_num}; 
       #skip printing - could write a header here instead. 
       next;
   }
   #look for a 'K' number. 
   if ( my ($K_id) = $line =~ m/ko:(K\d+)/ ) {
       #skip it if we've already seen this combination of 'ko' number 
       #and k number.    
       next if $seen{$ko_num}{$K_id}++; 
       #print the output to this particular output file. 
       print {$output_files{$ko_num}} $line; 
   }
}
#close the filehandles. 
close ( $_ ) for values %output_files;

因此，通过这种方式 - 您可以运行“myscript.pl file1.txt file2.txt file3.txt”，它应该以可扩展的方式做正确的事情。它甚至不关心它们是单独的文件还是单个流。

按特定模式合并 3 个文件中的特定行

答案1

答案2

答案3

相关内容