替换一个巨大（70GB）的单行文本文件中的字符串

Question 1

对于如此大的文件，一种可能是 Flex。设unk.l：

%%
\<unk\>     printf("<raw_unk>");  
%%

然后编译并执行：

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new

Answer

对于如此大的文件，一种可能是 Flex。设unk.l：

%%
\<unk\>     printf("<raw_unk>");  
%%

然后编译并执行：

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new

Question 2

通常的文本处理工具并不是为了处理不适合 RAM 的行而设计的。它们的工作方式往往是读取一条记录（一行）、对其进行操作并输出结果，然后继续处理下一条记录（行）。

如果文件中经常出现某个 ASCII 字符，但未出现在<unk>或中<raw_unk>，则可以使用它作为记录分隔符。由于大多数工具不允许自定义记录分隔符，因此请在该字符和换行符之间交换。tr处理字节，而不是行，因此它不关心任何记录大小。假设;有效：

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

您还可以锚定您正在搜索的文本的第一个字符，假设它在搜索文本中没有重复并且出现得足够频繁。如果文件可能以开头unk>，请将 sed 命令更改为sed '2,$ s/…以避免虚假匹配。

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

或者，使用最后一个字符。

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

请注意，此技术假设 sed 对不以换行符结尾的文件进行无缝操作，即它处理最后一部分行而不截断它，也不附加最终换行符。它与 GNU sed 一起使用。如果您可以选择文件的最后一个字符作为记录分隔符，您将避免任何可移植性问题。

Answer

通常的文本处理工具并不是为了处理不适合 RAM 的行而设计的。它们的工作方式往往是读取一条记录（一行）、对其进行操作并输出结果，然后继续处理下一条记录（行）。

如果文件中经常出现某个 ASCII 字符，但未出现在<unk>或中<raw_unk>，则可以使用它作为记录分隔符。由于大多数工具不允许自定义记录分隔符，因此请在该字符和换行符之间交换。tr处理字节，而不是行，因此它不关心任何记录大小。假设;有效：

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

您还可以锚定您正在搜索的文本的第一个字符，假设它在搜索文本中没有重复并且出现得足够频繁。如果文件可能以开头unk>，请将 sed 命令更改为sed '2,$ s/…以避免虚假匹配。

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

或者，使用最后一个字符。

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

请注意，此技术假设 sed 对不以换行符结尾的文件进行无缝操作，即它处理最后一部分行而不截断它，也不附加最终换行符。它与 GNU sed 一起使用。如果您可以选择文件的最后一个字符作为记录分隔符，您将避免任何可移植性问题。

Question 3

所以你还不够身体的内存 (RAM) 可以一次保存整个文件，但在 64 位系统上，您有足够的内存虚拟的地址空间来映射整个文件。在这种情况下，虚拟映射可以作为一种简单的破解方法。

必要的操作都包含在Python中。虽然存在一些烦人的微妙之处，但它确实避免了编写 C 代码。特别是，需要小心避免在内存中复制文件，这将完全违背这一点。从好的方面来说，你可以免费获得错误报告（python“异常”）:)。

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])

Answer

所以你还不够身体的内存 (RAM) 可以一次保存整个文件，但在 64 位系统上，您有足够的内存虚拟的地址空间来映射整个文件。在这种情况下，虚拟映射可以作为一种简单的破解方法。

必要的操作都包含在Python中。虽然存在一些烦人的微妙之处，但它确实避免了编写 C 代码。特别是，需要小心避免在内存中复制文件，这将完全违背这一点。从好的方面来说，你可以免费获得错误报告（python“异常”）:)。

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])

Question 4

我认为 C 版本可能会表现得更好：

#include <stdio.h>
#include <string.h>

#define PAT_LEN 5

int main()
{
    /* note this is not a general solution. In particular the pattern
     * must not have a repeated sequence at the start, so <unk> is fine
     * but aardvark is not, because it starts with "a" repeated, and ababc
     * is not because it starts with "ab" repeated. */
    char pattern[] = "<unk>";          /* set PAT_LEN to length of this */
    char replacement[] = "<raw_unk>"; 
    int c;
    int i, j;

    for (i = 0; (c = getchar()) != EOF;) {
        if (c == pattern[i]) {
            i++;
            if (i == PAT_LEN) {
                printf("%s", replacement);
                i = 0;
            }
        } else {
            if (i > 0) {
                for (j = 0; j < i; j++) {
                    putchar(pattern[j]);
                }
                i = 0;
            }
            if (c == pattern[0]) {
                i = 1;
            } else {
                putchar(c);
            }
        }
    }
    /* TODO: fix up end of file if it ends with a part of pattern */
    return 0;
}

编辑：根据评论的建议进行修改。还修复了模式的错误<<unk>。

Answer

我认为 C 版本可能会表现得更好：

#include <stdio.h>
#include <string.h>

#define PAT_LEN 5

int main()
{
    /* note this is not a general solution. In particular the pattern
     * must not have a repeated sequence at the start, so <unk> is fine
     * but aardvark is not, because it starts with "a" repeated, and ababc
     * is not because it starts with "ab" repeated. */
    char pattern[] = "<unk>";          /* set PAT_LEN to length of this */
    char replacement[] = "<raw_unk>"; 
    int c;
    int i, j;

    for (i = 0; (c = getchar()) != EOF;) {
        if (c == pattern[i]) {
            i++;
            if (i == PAT_LEN) {
                printf("%s", replacement);
                i = 0;
            }
        } else {
            if (i > 0) {
                for (j = 0; j < i; j++) {
                    putchar(pattern[j]);
                }
                i = 0;
            }
            if (c == pattern[0]) {
                i = 1;
            } else {
                putchar(c);
            }
        }
    }
    /* TODO: fix up end of file if it ends with a part of pattern */
    return 0;
}

编辑：根据评论的建议进行修改。还修复了模式的错误<<unk>。

替换一个巨大（70GB）的单行文本文件中的字符串

答案1

答案2

答案3

答案4

相关内容