我需要在文本文件中找到一个 13 个字符长度的字符串,该字符串附近有一个重复项。
它指的是基因组的突变13。
例如:
ACGAATTGCAGCCACAGTACGAATCGCAGCC。
它以 ACGAATTGCAGCC 开头并以它结尾,但中间是长度未知的随机字符。
到目前为止我想到的是:
grep -Eo '((.){13}).{1,100}\1'
我必须在这个里面找到它
GTACCATAACTAACAACCTGAAAAGTCACAAAAACATATACAATAAAAGAACTAGATTTCGCATAGGATATATATTAATAAAGTGAACAAAAAAAAAATAACAACAACAACAACGAATGAAGAAAGGAAAAGGAATGATAAAAAAACGAGTAATAATTGAAAACAATTATAAAGTAAGAAAACCGCAACGGCCCAAGTAAGCAAAGCAAGGATAGGAAATTGATCGACACAACTCCATAAAATTTACAACTAGTACTCAGAAAAAATAACTAAGCTATATCCATATCTACTCTAAAAAAGAAAAGGAATAACGGAACACCCACAAAGAAACTCAATTAGCAAAAACCACAGATAATACAAACCAGAGAAGACCACATAAAAAAATGAACGAGTTACCCTTCAAATTAAAATAAATCTACCAGTAAGCATAAAAACAACAAAGTTACAAAACCAAAGACCAAAAGTAGAAATCAGAACAAGGGACATAAACGTTCACCAAATGAATGAAACAACACAATTTAGAAACAAAAAAGAGGAATAAAAAGCCAGAACAGGAGTACGAACATAATTAATTATGAAAGTGACCTACAAATAAGAAGGAAACACAAACAGAAAACAACTAACCACAAAAAAGACATAATAGTAAACAAAAAAAAAAAACTTACTCATACGAGGACTAATAAAAGATTCAAAACAATACAATTGACGAAAACTCAACGAGGAAAGCTAGAAAACCACCAGAGAAACTCAAAACACAAATAGAGATAAAAAAAAAAACCATAAAGAAAAATTCTTACATCGTCACAGCCAAGGAAAAAAAGAAATCGTTAAAATGGAACGCAGTCGAACACAAAAAGACAACACAGAACAAAAAAGGCAAACAGCGTAGAAACAAATACACTCGCGTAGCAAAGGGGCGGCGTCACGCTTGAAACATAAAAATAACCACTGTATATCACGACAATCAACAAAGTCTACATCAAGAAAATCAAAAAAATAC
答案1
你已经很接近了,问题是 100,太窄了!您可能需要考虑使用 Perl PCRE 而不是 Posix Extended。性能差异非常明显。
grep -Po '((.){13}).{1,1000}?\1' genom
AACAAAAAAAAAATAACAACAACAACAACGAATGAAGAAAGGAAAAGGAATGATAAAAAAACGAGTAATAATTGAAAACAATTATAAAGTAAGAAAACCGCAACGGCCCAAGTAAGCAAAGCAAGGATAGGAAATTGATCGACACAACTCCATAAAATTTACAACTAGTACTCAGAAAAAATAACTAAGCTATATCCATATCTACTCTAAAAAAGAAAAGGAATAACGGAACACCCACAAAGAAACTCAATTAGCAAAAACCACAGATAATACAAACCAGAGAAGACCACATAAAAAAATGAACGAGTTACCCTTCAAATTAAAATAAATCTACCAGTAAGCATAAAAACAACAAAGTTACAAAACCAAAGACCAAAAGTAGAAATCAGAACAAGGGACATAAACGTTCACCAAATGAATGAAACAACACAATTTAGAAACAAAAAAGAGGAATAAAAAGCCAGAACAGGAGTACGAACATAATTAATTATGAAAGTGACCTACAAATAAGAAGGAAACACAAACAGAAAACAACTAACCACAAAAAAGACATAATAGTAAACAAAAAAAAAA
我的机器上的时间比较:
Posix: (-E) 0m4.816s
Perl: (-P) 0m0.011s