使用 awk 或 sed 删除文本文件中行之间的行

使用 awk 或 sed 删除文本文件中行之间的行

我想知道是否有 sed 或 awk 命令来删除第 1 列中“Query_”标题之间的所有行(如果每个标题之间的行数小于 5)。以下是从大文件 ~1Gb 中摘录的内容。我尝试了多种不同的方法,但都失败了。

Query_10      26   KMGKWYPTEDAPAKKRKTQSWRQNKSKLRGGIVPGQVLIILAGKHKGKRVVYLTQLSTGE  205
XP_010718494  131  KMPRYYPTEDVPRKSHGKKPFSQHKRRLRASITPGTVLILLTGRHRGKRVVFLKQLGTGL  192
NP_001291831  111  KMPRYYPTEDVPRKSHGKKPFSQHVRKLRASITPGTILIILTGRHRGKRVVFLKQLSSGL  172
Query_10      206  IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK  385
XP_010718494  193  LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT  255
NP_001291831  173  LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_012359817  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_009246541  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_003225150  155  LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT  217
Query_13      31    MEEQKEKGLSNPEVV*KYRQCSEIVNQVLSTVVSSCVPGADVASICTNGDFLIEDGLRNI  210
XP_002947167  7     IQGEQEPNLSVPEVVTKYKAAADICNRALQAVIDGCKDGSKIVDLCRTGDNFITKECGNI  66
XP_004993505  1     MELDRQSKVVDADALSKYRAAAAIANDCVQQLVANCIAGADVYTLAVEADTYIEQKLKEL  60
XP_006961234  1     MSETKEYSLNNPDTLTKYKTAAQISEKVLAAVSDLCVPGAKIVDICQQGDKLIEEELAKV  62
XP_008089018  1     MSEETDYTLNNPDTLTKYKTAAQISEKVLAAVAELVVPGEKIVTICEKGDKLIEEELAKV  60
Query_13      211   EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI  390
XP_004029906  65    YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV  131
XP_004031065  64    FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV  130
XP_003223249  65    KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF  125
XP_002947167  67    YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV  126
XP_003880798  73    ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV  135
XP_004348044  69    KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI  129
XP_003284133  69    HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI  128
NP_001241588  65    YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH  126
XP_009039553  76    YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM  141

期望的结果如下:

Query_10      206  IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK  385
XP_010718494  193  LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT  255
NP_001291831  173  LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_012359817  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_009246541  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_003225150  155  LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT  217
Query_13      211   EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI  390
XP_004029906  65    YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV  131
XP_004031065  64    FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV  130
XP_003223249  65    KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF  125
XP_002947167  67    YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV  126
XP_003880798  73    ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV  135
XP_004348044  69    KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI  129
XP_003284133  69    HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI  128
NP_001241588  65    YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH  126
XP_009039553  76    YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM  141

我尝试过的Python脚本:

lines = [line.rstrip() for line in open('infile.txt')]
for line in lines: 
    data = line.split()
    sequence = data[2]
    if data[0].startswith("Query_"):
        hits = [i for i,c in enumerate(sequence) if c == <50]
        continue
    else:
        print(list(sequence[plus50] for plus50 in hits))

答案1

sed:

sed '
    /^Query_/{                #starts loop when meet patten
        :a
        $!{
            N
            /\nQuery_/!ba     #untill meet next pattern
        }
        /\(\n.*\)\{6,\}/{     #checks how many lines in block
            $b                #for end of file
            s/\nQuery_/\n&/   #marks lines to print
        }
    }
    /\n\n/P                   #prints marked lines
    D                         #remove 1st line in block, go to start
    '

其他脚本形式awk:

awk '
    /^Query/{c=0;lines=$0;next}
    ++c<5{lines=lines "\n" $0;next}
    c==5{print lines}
    1                         #short for {print}
    '

答案2

awk

awk '{if($1~/^Query_/){c=0;delete a;a[0]=$0}else{c++}
    if(c<5){a[c]=$0}
    if(c==5){for(i in a){print a[i]}}
    if(c>5){print}}' file

  • 在第一行中,$1检查第一个字段是否以 开头Query_。如果是,则计数器变量c设置为0。数组a将被删除,数组的第一个元素将设置为该行的值。否则计数器变量会递增。
  • 在第二行中,逐行填充数组,直到其中又包含 5 行。
  • 第三行:如果还有 5 行,则循环遍历数组并逐行打印其元素。
  • 第四行:从现在开始的所有行都可以打印。

输出示例数据:

Query_10      206  IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK  385
XP_010718494  193  LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT  255
NP_001291831  173  LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_012359817  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_009246541  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
Query_13      211   EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI  390
XP_004029906  65    YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV  131
XP_004031065  64    FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV  130
XP_003223249  65    KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF  125
XP_002947167  67    YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV  126
XP_004348044  69    KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI  129
XP_003284133  69    HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI  128
NP_001241588  65    YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH  126
XP_009039553  76    YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM  141

答案3

GNU awk

$ awk -F'\n' -v RS='Query_' -v ORS= 'NF>6{print RS $0}' ip.txt
Query_10      206  IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK  385
XP_010718494  193  LLVTGPLVVNRVPLRRAHQKFVIATSTKVDISGVKIHLTDAYFKKKKLRKPKQEGEIFDT  255
NP_001291831  173  LLVTGPLSLNRVPLRRTHQKFVIATSTKIDISSVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_012359817  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_009246541  173  LLVTGPLVLNRVPLRRTHQKFVIATSTKIDISNVKIHLTDAYFKKKKP--RHQEGEIFDT  235
XP_003225150  155  LLVTGPLAINRVPLRRAHQKFVIATSTKVDISSVKLHLNDVYFKKKKLRKPKQEGEIFDT  217
Query_13      211   EPDTNIEKGIAIPVCLNINNICSYYSPLPDASTTLQEGDLVKVDLGAHFDGYIVSAASSI  390
XP_004029906  65    YTKKKVEKGPAFPTCISINEICGHYSPLLSDSSLLKEGDVVKIDLGTHIDGFIALGAHTV  131
XP_004031065  64    FTKKKLQKGPAFPTCISVNEICGHYSPLISDSSLLKEGDVVKIDLGAQIDGFIALAAHTV  130
XP_003223249  65    KKEKDMKKGIAFPTSISVNNCVCHFSPLKDQDYILKEGDLVKIDLGVHVDGFISNVAHSF  125
XP_002947167  67    YKGKQIEKGVAFPTCVSVNSVVGHFSPNADDTSALKAGDVVKFDMGCHIDGFIATQATTV  126
XP_003880798  73    ENGKKMEKGIAFPTCISINEICGHFSPVEENAETLTEGDVVKIDMGCHIDGYISVVAYTV  135
XP_004348044  69    KANKKVKKGIAFPTCVSLNSTVCHQSPLSDAAITLQAGDVAKVDLGVHVDGLIAVVAHTI  129
XP_003284133  69    HSKKKIEKGIAFPTCISVNNCVGHYSPLKATSRSLVDGDIVKIDLGVHINGFIAVGAHTI  128
NP_001241588  65    YKNVKIERGVAFPTCLSINNVVCHFSPLASDEAVLEEGDILKIDMACHIDGFIAVVAHTH  126
XP_009039553  76    YQKKIIDKGVAFPTCVSVNECVCHNSPLESDTTSLSEGDLVKLDVGCYVDGYIAVAAHTM  141
  • -v RS='Query_'设置Query_为输入记录分隔符
  • -v ORS=设置空字符串作为输出记录分隔符
  • -F'\n'设置换行符作为输入字段分隔符
  • NF>6问题是保留具有 5 个条目的块。包括标题在内共有 6 行,这意味着 6 个换行符。分割这个最小必需的字符串将给出 7 个字段 - 因此条件NF>6
  • print RS $0满足条件时打印RS并输入记录

相关内容