在大型文本文件中查找字符串的多个位置

Question 1

$ awk -v str='to' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
1: 1
1: 14

或者，更漂亮的格式：

awk -v str='to' '
    {
        off = 0  # current offset in the line from whence we are searching
        while (pos = index(substr($0, off + 1), str)) {
            # pos is the position within the substring where the string was found
            printf("%d: %d\n", NR, pos + off)
            off += length(str) + pos
        }
    }' file

程序awk输出行号，后跟该行上字符串的位置。如果该字符串在一行中多次出现，则会产生多行输出。

该程序使用该index()函数查找该行中的字符串，如果找到，则打印找到该字符串的行上的位置。然后，它对该行的其余部分重复该过程（使用该substr()函数），直到找不到该字符串的更多实例。

在代码中，该off变量跟踪距我们需要进行下一次搜索的行开头的偏移量。该变量包含子字符串中找到该字符串的pso偏移位置。off

该字符串使用在命令行上传递-v str='to'。

例子：

$ cat file
To be, or not to be: that is the question:
Whether ‘tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, ‘tis a consummation
Devoutly to be wish’d. To die, to sleep;

$ awk -v str='the' '{ off=0; while (pos=index(substr($0,off+1), str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos} }' file
1: 30
2: 4
2: 26
5: 21
7: 20

Answer

$ awk -v str='to' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
1: 1
1: 14

或者，更漂亮的格式：

awk -v str='to' '
    {
        off = 0  # current offset in the line from whence we are searching
        while (pos = index(substr($0, off + 1), str)) {
            # pos is the position within the substring where the string was found
            printf("%d: %d\n", NR, pos + off)
            off += length(str) + pos
        }
    }' file

程序awk输出行号，后跟该行上字符串的位置。如果该字符串在一行中多次出现，则会产生多行输出。

该程序使用该index()函数查找该行中的字符串，如果找到，则打印找到该字符串的行上的位置。然后，它对该行的其余部分重复该过程（使用该substr()函数），直到找不到该字符串的更多实例。

在代码中，该off变量跟踪距我们需要进行下一次搜索的行开头的偏移量。该变量包含子字符串中找到该字符串的pso偏移位置。off

该字符串使用在命令行上传递-v str='to'。

例子：

$ cat file
To be, or not to be: that is the question:
Whether ‘tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, ‘tis a consummation
Devoutly to be wish’d. To die, to sleep;

$ awk -v str='the' '{ off=0; while (pos=index(substr($0,off+1), str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos} }' file
1: 30
2: 4
2: 26
5: 21
7: 20

Question 2

尝试

grep -b 'to' file

用于从文件开头的偏移量；或者

grep -nb 'to' file

用于行号和偏移量。

Answer

尝试

grep -b 'to' file

用于从文件开头的偏移量；或者

grep -nb 'to' file

用于行号和偏移量。

Question 3

如果您的文件有多行，要查找字符串的第一次出现，您可以使用：

sed -zE 's/^(\w[^to]+)(to)(.*)/\1\2/' YourFile | wc -c

Answer

如果您的文件有多行，要查找字符串的第一次出现，您可以使用：

sed -zE 's/^(\w[^to]+)(to)(.*)/\1\2/' YourFile | wc -c

Question 4

您可以使用grep以下方法来执行此操作：

$ grep -aob 'to' file | grep -oE '[0-9]+'
0
13

顺便说一句，当您声明要查找 0,14 时，您的数学会出现错误，to如果您将 0 算作第一个，则第二个从位置 13 开始，而您的坐标似乎是从 0 开始的。

如果您希望上面的输出是逗号分隔的坐标列表：

$ grep -aob 'to' file | grep -oE '[0-9]+' | paste -s -d ','
0,13

它是如何工作的？

此方法利用了 GNUgrep打印匹配项字节偏移量 ( -b) 的能力，并且我们强制它仅通过-o开关打印这些内容。

   -b, --byte-offset
          Print the 0-based byte offset within the input file before each
          line of output.  If -o (--only-matching) is specified, print the 
          offset of the matching part itself.

更高级的例子

如果您的示例包含诸如toto或 were multi-lines 之类的单词，则上述方法的改进版本也可以处理这些单词。

样本数据

$ cat file
to be or not to be, that's the question
that is the to to question
toto is a dog

例子

$ grep -aob '\bto\b' file | grep -oE '[0-9]+' | paste -s -d ','
0,13,52,55

\b在这里，我们在要计数的单词两侧使用单词边界，仅计算字符串的显式出现次数to，而不计算诸如之类的单词toto。

参考

Answer

您可以使用grep以下方法来执行此操作：

$ grep -aob 'to' file | grep -oE '[0-9]+'
0
13

顺便说一句，当您声明要查找 0,14 时，您的数学会出现错误，to如果您将 0 算作第一个，则第二个从位置 13 开始，而您的坐标似乎是从 0 开始的。

如果您希望上面的输出是逗号分隔的坐标列表：

$ grep -aob 'to' file | grep -oE '[0-9]+' | paste -s -d ','
0,13

它是如何工作的？

此方法利用了 GNUgrep打印匹配项字节偏移量 ( -b) 的能力，并且我们强制它仅通过-o开关打印这些内容。

   -b, --byte-offset
          Print the 0-based byte offset within the input file before each
          line of output.  If -o (--only-matching) is specified, print the 
          offset of the matching part itself.

更高级的例子

如果您的示例包含诸如toto或 were multi-lines 之类的单词，则上述方法的改进版本也可以处理这些单词。

样本数据

$ cat file
to be or not to be, that's the question
that is the to to question
toto is a dog

例子

$ grep -aob '\bto\b' file | grep -oE '[0-9]+' | paste -s -d ','
0,13,52,55

\b在这里，我们在要计数的单词两侧使用单词边界，仅计算字符串的显式出现次数to，而不计算诸如之类的单词toto。

在大型文本文件中查找字符串的多个位置

答案1

答案2

答案3

答案4

它是如何工作的？

更高级的例子

参考

相关内容