查找文本中中间彼此靠近的重复字符串

查找文本中中间彼此靠近的重复字符串

我需要在文本文件中找到一个 13 个字符长度的字符串,该字符串附近有一个重复项。
它指的是基因组的突变13。

例如:

ACGAATTGCAGCCACAGTACGAATCGCAGCC。

它以 ACGAATTGCAGCC 开头并以它结尾,但中间是长度未知的随机字符。

到目前为止我想到的是:

grep -Eo '((.){13}).{1,100}\1'

我必须在这个里面找到它

GTACCATAACTAACAACCTGAAAAGTCACAAAAACATATACAATAAAAGAACTAGATTTCGCATAGGATATATATTAATAAAGTGAACAAAAAAAAAATAACAACAACAACAACGAATGAAGAAAGGAAAAGGAATGATAAAAAAACGAGTAATAATTGAAAACAATTATAAAGTAAGAAAACCGCAACGGCCCAAGTAAGCAAAGCAAGGATAGGAAATTGATCGACACAACTCCATAAAATTTACAACTAGTACTCAGAAAAAATAACTAAGCTATATCCATATCTACTCTAAAAAAGAAAAGGAATAACGGAACACCCACAAAGAAACTCAATTAGCAAAAACCACAGATAATACAAACCAGAGAAGACCACATAAAAAAATGAACGAGTTACCCTTCAAATTAAAATAAATCTACCAGTAAGCATAAAAACAACAAAGTTACAAAACCAAAGACCAAAAGTAGAAATCAGAACAAGGGACATAAACGTTCACCAAATGAATGAAACAACACAATTTAGAAACAAAAAAGAGGAATAAAAAGCCAGAACAGGAGTACGAACATAATTAATTATGAAAGTGACCTACAAATAAGAAGGAAACACAAACAGAAAACAACTAACCACAAAAAAGACATAATAGTAAACAAAAAAAAAAAACTTACTCATACGAGGACTAATAAAAGATTCAAAACAATACAATTGACGAAAACTCAACGAGGAAAGCTAGAAAACCACCAGAGAAACTCAAAACACAAATAGAGATAAAAAAAAAAACCATAAAGAAAAATTCTTACATCGTCACAGCCAAGGAAAAAAAGAAATCGTTAAAATGGAACGCAGTCGAACACAAAAAGACAACACAGAACAAAAAAGGCAAACAGCGTAGAAACAAATACACTCGCGTAGCAAAGGGGCGGCGTCACGCTTGAAACATAAAAATAACCACTGTATATCACGACAATCAACAAAGTCTACATCAAGAAAATCAAAAAAATAC

答案1

你已经很接近了,问题是 100,太窄了!您可能需要考虑使用 Perl PCRE 而不是 Posix Extended。性能差异非常明显。

grep -Po '((.){13}).{1,1000}?\1' genom
AACAAAAAAAAAATAACAACAACAACAACGAATGAAGAAAGGAAAAGGAATGATAAAAAAACGAGTAATAATTGAAAACAATTATAAAGTAAGAAAACCGCAACGGCCCAAGTAAGCAAAGCAAGGATAGGAAATTGATCGACACAACTCCATAAAATTTACAACTAGTACTCAGAAAAAATAACTAAGCTATATCCATATCTACTCTAAAAAAGAAAAGGAATAACGGAACACCCACAAAGAAACTCAATTAGCAAAAACCACAGATAATACAAACCAGAGAAGACCACATAAAAAAATGAACGAGTTACCCTTCAAATTAAAATAAATCTACCAGTAAGCATAAAAACAACAAAGTTACAAAACCAAAGACCAAAAGTAGAAATCAGAACAAGGGACATAAACGTTCACCAAATGAATGAAACAACACAATTTAGAAACAAAAAAGAGGAATAAAAAGCCAGAACAGGAGTACGAACATAATTAATTATGAAAGTGACCTACAAATAAGAAGGAAACACAAACAGAAAACAACTAACCACAAAAAAGACATAATAGTAAACAAAAAAAAAA

我的机器上的时间比较:

Posix: (-E)  0m4.816s
Perl:  (-P)  0m0.011s

相关内容