如何 grep 多行长字符串，而不知道换行符在哪里

Question 1

现在我对这个问题有了更好的理解，所以我添加了一个新答案。我只是将其作为一个工作示例发布，但我并不认为这是一个好示例。:)

此外，我明白这个问题似乎是因为担心效率低下而不想使用 Python。所以我明白这种方法不能满足整个要求。:(

#!/usr/bin/env python
import sys

def findall_iter(S, pat):
  index = -1
  while True:
    try:
      index = S.index(pat, index+1)
      yield index
    except ValueError:
      raise StopIteration

def findall(S, pat):
  return list(findall_iter(S, pat))

# read in arguments
S = open(sys.argv[2]).read()
pattern = sys.argv[1]

# get indices of all newlines
newline_indices = findall(S, '\n')

# get psudo-indices of all pattern matches
pat_indices = findall(S.replace('\n', ''), pattern)

# iterate through each pattern match psudo-index and
# correlate it back to a real line number from the file
line_numbers = []
for pi in pat_indices:
  for i, ni in enumerate(newline_indices):
    if ni > pi+i:
      line = i + 1
      if line not in line_numbers:
        line_numbers.append(i+1)
      break

print '\n'.join(map(str, line_numbers))

优点：

如果文件不是太大（<1GB）则所有操作都在内存中执行。
使用 str.index 方法查找子字符串，而不是（较慢的）正则表达式匹配
比使用正则表达式更清晰

缺点：

不适用于处理大文件。
创建两个临时字符串来完成这项工作。
最后的 for 循环很难理解。
是 Python（我个人并不认为这是一个缺点）。

Answer

现在我对这个问题有了更好的理解，所以我添加了一个新答案。我只是将其作为一个工作示例发布，但我并不认为这是一个好示例。:)

此外，我明白这个问题似乎是因为担心效率低下而不想使用 Python。所以我明白这种方法不能满足整个要求。:(

#!/usr/bin/env python
import sys

def findall_iter(S, pat):
  index = -1
  while True:
    try:
      index = S.index(pat, index+1)
      yield index
    except ValueError:
      raise StopIteration

def findall(S, pat):
  return list(findall_iter(S, pat))

# read in arguments
S = open(sys.argv[2]).read()
pattern = sys.argv[1]

# get indices of all newlines
newline_indices = findall(S, '\n')

# get psudo-indices of all pattern matches
pat_indices = findall(S.replace('\n', ''), pattern)

# iterate through each pattern match psudo-index and
# correlate it back to a real line number from the file
line_numbers = []
for pi in pat_indices:
  for i, ni in enumerate(newline_indices):
    if ni > pi+i:
      line = i + 1
      if line not in line_numbers:
        line_numbers.append(i+1)
      break

print '\n'.join(map(str, line_numbers))

优点：

如果文件不是太大（<1GB）则所有操作都在内存中执行。
使用 str.index 方法查找子字符串，而不是（较慢的）正则表达式匹配
比使用正则表达式更清晰

缺点：

不适用于处理大文件。
创建两个临时字符串来完成这项工作。
最后的 for 循环很难理解。
是 Python（我个人并不认为这是一个缺点）。

Question 2

我会用脚本来做这件事sed。把它放在一个文件中，然后用来sed -nf运行它。

:restart
/gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc$/{
    #   Found the first part, now discard it
    s/^.*$//
    #   Read a new line into the buffer
    N
    #   Discard the new line inserted by the N operation
    s/^\n//
    #   If next line isn't a match, start over
    /^baedhacebeeebcechbcbfeeccbdhcbfg/!b restart
    #   If it is a match, print the line number
    =
    }

在下运行它看起来是这样的bash。请注意，它打印出匹配的第二行的行号。

bash-4.1$ cat sample.txt
abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde
abcdeabcde***gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc
baedhacebeeebcechbcbfeeccbdhcbfg***ggfbhbgcedabceedfa
fbaaechaabdbffbebecebaacfcfcdcggfchddcefbcbdegbbba
bash-4.1$
bash-4.1$ cat findmatch.sed
:restart
/gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc$/{
   #  Found the first part, now discard it
   s/^.*$//
   #  Read a new line into the buffer
   N
   #  Discard the new line inserted by the N operation
   s/^\n//
   #  If next line isn't a match, start over
   /^baedhacebeeebcechbcbfeeccbdhcbfg/!b restart
   #  If it is a match, print the line number
   =
   }
bash-4.1$
bash-4.1$ sed -nf findmatch.sed sample.txt
3
bash-4.1$

Answer

我会用脚本来做这件事sed。把它放在一个文件中，然后用来sed -nf运行它。

:restart
/gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc$/{
    #   Found the first part, now discard it
    s/^.*$//
    #   Read a new line into the buffer
    N
    #   Discard the new line inserted by the N operation
    s/^\n//
    #   If next line isn't a match, start over
    /^baedhacebeeebcechbcbfeeccbdhcbfg/!b restart
    #   If it is a match, print the line number
    =
    }

在下运行它看起来是这样的bash。请注意，它打印出匹配的第二行的行号。

bash-4.1$ cat sample.txt
abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde
abcdeabcde***gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc
baedhacebeeebcechbcbfeeccbdhcbfg***ggfbhbgcedabceedfa
fbaaechaabdbffbebecebaacfcfcdcggfchddcefbcbdegbbba
bash-4.1$
bash-4.1$ cat findmatch.sed
:restart
/gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc$/{
   #  Found the first part, now discard it
   s/^.*$//
   #  Read a new line into the buffer
   N
   #  Discard the new line inserted by the N operation
   s/^\n//
   #  If next line isn't a match, start over
   /^baedhacebeeebcechbcbfeeccbdhcbfg/!b restart
   #  If it is a match, print the line number
   =
   }
bash-4.1$
bash-4.1$ sed -nf findmatch.sed sample.txt
3
bash-4.1$

Question 3

我有点困惑您是在什么限制下操作的。但是，如果您需要行号，grep 和 pcregrep 都可以使用 -n 标志将其提供给您。

$ pcregrep -nM "gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc\nbaedhacebeeebcechbcbfeeccbdhcbfg" | cut -d: -f1
2
baedhacebeeebcechbcbfeeccbdhcbfg***ggfbhbgcedabceedfa

sed -n 'p;N'pcregrep 仅显示匹配的第一行的行号，显然，如果您只希望输出行号，则必须使用 sed 跳过输出的其他每一行（将上面的内容连接到）。

Answer

我有点困惑您是在什么限制下操作的。但是，如果您需要行号，grep 和 pcregrep 都可以使用 -n 标志将其提供给您。

$ pcregrep -nM "gcbcdbfceebcfhfchaaccdgfcegffgedffaeaedc\nbaedhacebeeebcechbcbfeeccbdhcbfg" | cut -d: -f1
2
baedhacebeeebcechbcbfeeccbdhcbfg***ggfbhbgcedabceedfa

sed -n 'p;N'pcregrep 仅显示匹配的第一行的行号，显然，如果您只希望输出行号，则必须使用 sed 跳过输出的其他每一行（将上面的内容连接到）。

如何 grep 多行长字符串，而不知道换行符在哪里

答案1

答案2

答案3

相关内容