我想知道是否有 sed 或 awk 命令来删除第 1 列中“Query_”标题之间的所有行(如果每个标题之间的行数小于 5)。以下是从大文件 ~1Gb 中摘录的内容。我尝试了多种不同的方法,但都失败了。
Query_10 26 KMGKWYPTEDAPAKKRKTQSWRQNKSKLRGGIVPGQVLIILAGKHKGKRVVYLTQLSTGE 205
XP_010718494 131 KMPRYYPTEDVPRKSHGKKPFSQHKRRLRASITPGTVLILLTGRHRGKRVVFLKQLGTGL 192
NP_001291831 111 KMPRYYPTEDVPRKSHGKKPFSQHVRKLRASITPGTILIILTGRHRGKRVVFLKQLSSGL 172
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
XP_010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
NP_001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_003225150 155 LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT 217
Query_13 31 MEEQKEKGLSNPEVV*KYRQCSEIVNQVLSTVVSSCVPGADVASICTNGDFLIEDGLRNI 210
XP_002947167 7 IQGEQEPNLSVPEVVTKYKAAADICNRALQAVIDGCKDGSKIVDLCRTGDNFITKECGNI 66
XP_004993505 1 MELDRQSKVVDADALSKYRAAAAIANDCVQQLVANCIAGADVYTLAVEADTYIEQKLKEL 60
XP_006961234 1 MSETKEYSLNNPDTLTKYKTAAQISEKVLAAVSDLCVPGAKIVDICQQGDKLIEEELAKV 62
XP_008089018 1 MSEETDYTLNNPDTLTKYKTAAQISEKVLAAVAELVVPGEKIVTICEKGDKLIEEELAKV 60
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
XP_004029906 65 YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV 131
XP_004031065 64 FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV 130
XP_003223249 65 KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF 125
XP_002947167 67 YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV 126
XP_003880798 73 ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV 135
XP_004348044 69 KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI 129
XP_003284133 69 HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI 128
NP_001241588 65 YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH 126
XP_009039553 76 YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM 141
期望的结果如下:
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
XP_010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
NP_001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_003225150 155 LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT 217
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
XP_004029906 65 YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV 131
XP_004031065 64 FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV 130
XP_003223249 65 KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF 125
XP_002947167 67 YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV 126
XP_003880798 73 ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV 135
XP_004348044 69 KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI 129
XP_003284133 69 HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI 128
NP_001241588 65 YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH 126
XP_009039553 76 YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM 141
我尝试过的Python脚本:
lines = [line.rstrip() for line in open('infile.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query_"):
hits = [i for i,c in enumerate(sequence) if c == <50]
continue
else:
print(list(sequence[plus50] for plus50 in hits))
答案1
和sed:
sed '
/^Query_/{ #starts loop when meet patten
:a
$!{
N
/\nQuery_/!ba #untill meet next pattern
}
/\(\n.*\)\{6,\}/{ #checks how many lines in block
$b #for end of file
s/\nQuery_/\n&/ #marks lines to print
}
}
/\n\n/P #prints marked lines
D #remove 1st line in block, go to start
'
其他脚本形式awk:
awk '
/^Query/{c=0;lines=$0;next}
++c<5{lines=lines "\n" $0;next}
c==5{print lines}
1 #short for {print}
'
答案2
和awk
:
awk '{if($1~/^Query_/){c=0;delete a;a[0]=$0}else{c++}
if(c<5){a[c]=$0}
if(c==5){for(i in a){print a[i]}}
if(c>5){print}}' file
- 在第一行中,
$1
检查第一个字段是否以 开头Query_
。如果是,则计数器变量c
设置为0
。数组a
将被删除,数组的第一个元素将设置为该行的值。否则计数器变量会递增。 - 在第二行中,逐行填充数组,直到其中又包含 5 行。
- 第三行:如果还有 5 行,则循环遍历数组并逐行打印其元素。
- 第四行:从现在开始的所有行都可以打印。
输出示例数据:
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
XP_010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
NP_001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
XP_004029906 65 YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV 131
XP_004031065 64 FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV 130
XP_003223249 65 KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF 125
XP_002947167 67 YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV 126
XP_004348044 69 KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI 129
XP_003284133 69 HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI 128
NP_001241588 65 YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH 126
XP_009039553 76 YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM 141
答案3
和GNU awk
$ awk -F'\n' -v RS='Query_' -v ORS= 'NF>6{print RS $0}' ip.txt
Query_10 206 IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK 385
XP_010718494 193 LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT 255
NP_001291831 173 LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_012359817 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_009246541 173 LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT 235
XP_003225150 155 LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT 217
Query_13 211 EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI 390
XP_004029906 65 YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV 131
XP_004031065 64 FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV 130
XP_003223249 65 KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF 125
XP_002947167 67 YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV 126
XP_003880798 73 ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV 135
XP_004348044 69 KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI 129
XP_003284133 69 HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI 128
NP_001241588 65 YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH 126
XP_009039553 76 YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM 141
-v RS='Query_'
设置Query_
为输入记录分隔符-v ORS=
设置空字符串作为输出记录分隔符-F'\n'
设置换行符作为输入字段分隔符NF>6
问题是保留具有 5 个条目的块。包括标题在内共有 6 行,这意味着 6 个换行符。分割这个最小必需的字符串将给出 7 个字段 - 因此条件NF>6
print RS $0
满足条件时打印RS并输入记录