我有 3 个这样的文件:
文件1:
ko00980 Metabolism of xenobiotics by cytochrome P450 (5)
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00982 Drug metabolism - cytochrome P450 (5)
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00485 FMO; dimethylaniline monooxygenase (N-oxide forming) [EC:1.14.13.8]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00983 Drug metabolism - other enzymes (4)
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
文件2:
ko00980 Metabolism of xenobiotics by cytochrome P450 (6)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00982 Drug metabolism - cytochrome P450 (4)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00983 Drug metabolism - other enzymes (8)
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00106 XDH; xanthine dehydrogenase/oxidase [EC:1.17.1.4 1.17.3.2]
ko:K00760 hprT; hypoxanthine phosphoribosyltransferase [EC:2.4.2.8]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01464 DPYS; dihydropyrimidinase [EC:3.5.2.2]
ko:K01519 ITPA; inosine triphosphate pyrophosphatase [EC:3.6.1.19]
ko:K13421 UMPS; uridine monophosphate synthetase [EC:2.4.2.10 4.1.1.23]
文件3:
ko00980 Metabolism of xenobiotics by cytochrome P450 (7)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00982 Drug metabolism - cytochrome P450 (6)
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00485 FMO; dimethylaniline monooxygenase (N-oxide forming) [EC:1.14.13.8]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko00983 Drug metabolism - other enzymes (8)
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00207 DPYD; dihydropyrimidine dehydrogenase (NADP+) [EC:1.3.1.2]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01489 cdd; cytidine deaminase [EC:3.5.4.5]
ko:K01951 guaA; GMP synthase (glutamine-hydrolysing) [EC:6.3.5.2]
每个文件都有以括号开头的标题行ko*****
,以及括号中的子标题行的名称和数量,例如:
ko00980 Metabolism of xenobiotics by cytochrome P450 (5)
子标题行开头为ko:K*****
我想合并 3 个文件中每个标题行的子标题行并执行uniq
.我想要这样的结果:
ko00980:
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko00982
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00485 FMO; dimethylaniline monooxygenase (N-oxide forming) [EC:1.14.13.8]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00207 DPYD; dihydropyrimidine dehydrogenase (NADP+) [EC:1.3.1.2]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01489 cdd; cytidine deaminase [EC:3.5.4.5]
ko:K01951 guaA; GMP synthase (glutamine-hydrolysing) [EC:6.3.5.2]
ko00983
ko:K00088 guaB; IMP dehydrogenase [EC:1.1.1.205]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00857 tdk; thymidine kinase [EC:2.7.1.21]
ko:K00876 udk; uridine kinase [EC:2.7.1.48]
ko:K00106 XDH; xanthine dehydrogenase/oxidase [EC:1.17.1.4 1.17.3.2]
ko:K00760 hprT; hypoxanthine phosphoribosyltransferase [EC:2.4.2.8]
ko:K01431 UPB1; beta-ureidopropionase [EC:3.5.1.6]
ko:K01464 DPYS; dihydropyrimidinase [EC:3.5.2.2]
ko:K01519 ITPA; inosine triphosphate pyrophosphatase [EC:3.6.1.19]
ko:K13421 UMPS; uridine monophosphate synthetase [EC:2.4.2.10 4.1.1.23]
ko:K00207 DPYD; dihydropyrimidine dehydrogenase (NADP+) [EC:1.3.1.2]
ko:K01489 cdd; cytidine deaminase [EC:3.5.4.5]
ko:K01951 guaA; GMP synthase (glutamine-hydrolysing) [EC:6.3.5.2]
答案1
有了awk
你就可以运行:
awk '/^ko[^:]/{fn=$1;next};/./{id=fn$1;if (!(seen[id]++)){print > fn}}' file[123]
在每个标题行上,它将标识符保存ko*****
为fn
,在子标题行上,它将保存fn$1
1作为id
数组的索引seen
,如果这是第一次id
看到,则将该行写入fn
。
1:你也可以使用fn$0
答案2
可能有一些神奇的超级混搭命令,但有时“线性”最容易理解和维护。
因此,我们只需要根据标题行跟踪文件名并附加数据。然后我们可以通过sort -u
结果来获得唯一的行:
#!/bin/bash
# Clean out old results from previous runs
/bin/rm -f ko*
for file in $@
do
filename=UNKNOWN
echo Processing $file
while read -r line
do
case $line in
ko:*) printf "%s\n" "$line" >> $filename ;;
ko*) filename=${line%% *} ; echo Switching to $filename ;;
"") # Do nothing
;;
*) echo Ignoring unknown line: $line
esac
done < $file
done
for file in ko*
do
echo Making unique: $file
sort -u -o $file $file
done
我们可以使用三个源文件运行它:
$ ./pattern_split file1 file2 file3
Processing file1
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file2
Switching to ko00980
Switching to ko00982
Switching to ko00983
Processing file3
Switching to ko00980
Switching to ko00982
Switching to ko00983
Making unique: ko00980
Making unique: ko00982
Making unique: ko00983
我们可以看到它创建了三个独一无二的文件。看第一个:
$ cat ko00980
ko:K00001 E1.1.1.1; alcohol dehydrogenase [EC:1.1.1.1]
ko:K00079 CBR1; carbonyl reductase 1 [EC:1.1.1.184 1.1.1.189 1.1.1.197]
ko:K00121 frmA; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ko:K00699 UGT; glucuronosyltransferase [EC:2.4.1.17]
ko:K00799 GST; glutathione S-transferase [EC:2.5.1.18]
ko:K07408 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1 [EC:1.14.14.1]
ko:K07409 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2 [EC:1.14.14.1]
现在这个解决方案应该是硬化的针对数据文件中的流氓数据(例如,如果有文件怎么办ko123/456
?那就会破坏。但这就是如何解决问题的概述。
答案3
那么,根据标头将文件中的行移动到单独的文件中吗?
我认为这样的事情可以做到这一点:
#!/usr/bin/env perl
use strict;
use warnings 'all';
#hash of output filehandles.
my %output_files;
#detect dupes
my %seen;
my $ko_num = 'NULL';
#<> is the 'magic' filehandle. You can either use it to iterate STDIN
#or take a list of file names on the command line (just like sed/grep etc.)
while ( my $line = <> ) {
#see if the line starts with 'ko':
if ( $line =~ m/(^ko\d+)/) {
$ko_num = $1;
#open a new file - for overwriting (so we only do this once)
open ( $output_files{$ko_num}, '>', $ko_num ) or die $! unless $output_files{$ko_num};
#skip printing - could write a header here instead.
next;
}
#look for a 'K' number.
if ( my ($K_id) = $line =~ m/ko:(K\d+)/ ) {
#skip it if we've already seen this combination of 'ko' number
#and k number.
next if $seen{$ko_num}{$K_id}++;
#print the output to this particular output file.
print {$output_files{$ko_num}} $line;
}
}
#close the filehandles.
close ( $_ ) for values %output_files;
因此,通过这种方式 - 您可以运行“myscript.pl file1.txt file2.txt file3.txt”,它应该以可扩展的方式做正确的事情。它甚至不关心它们是单独的文件还是单个流。