如何创建包含 UTF-8 字符编码的随机字符的文本文件（1 GB）？

Question 1

如果您想要代码点 0 到 0x7FFFFFFF 的 UTF-8 编码（UTF-8 编码算法最初设计用于处理）：

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)'

如今，Unicode 仅限于 0..D7FF、E000..10FFFF（尽管其中一些字符未分配，其中一些永远不会分配（被定义为非字符））。

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    print chr($c)'

如果你只想分配的字符，您可以将其通过管道传输到：

uconv -x '[:unassigned:]>;'

或者将其更改为：

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    $c = chr $c;
    print $c if $c =~ /\P{unassigned}/'

您可能更喜欢：

             if $c =~ /[\p{Space}\p{Graph}]/ && $c !~ /\p{Co}/

仅获取图形和间距（不包括私人使用部分的内容）。

现在，要获得 1GiB，您可以通过管道将其传输到head -c1G（假设 GNU head），但要注意最后一个字符可能会在中间被切断。

Answer

如果您想要代码点 0 到 0x7FFFFFFF 的 UTF-8 编码（UTF-8 编码算法最初设计用于处理）：

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\4}
    no warnings "utf8";
    print chr(unpack("L>",$_) & 0x7fffffff)'

如今，Unicode 仅限于 0..D7FF、E000..10FFFF（尽管其中一些字符未分配，其中一些永远不会分配（被定义为非字符））。

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    print chr($c)'

如果你只想分配的字符，您可以将其通过管道传输到：

uconv -x '[:unassigned:]>;'

或者将其更改为：

< /dev/urandom perl -CO -ne '
    BEGIN{$/=\3}
    no warnings "utf8";
    $c = unpack("L>","\0$_") * 0x10f800 >> 24;
    $c += 0x800 if $c >= 0xd800;
    $c = chr $c;
    print $c if $c =~ /\P{unassigned}/'

您可能更喜欢：

             if $c =~ /[\p{Space}\p{Graph}]/ && $c !~ /\p{Co}/

仅获取图形和间距（不包括私人使用部分的内容）。

现在，要获得 1GiB，您可以通过管道将其传输到head -c1G（假设 GNU head），但要注意最后一个字符可能会在中间被切断。

Question 2

创建大小为 10 MB 且采用 UTF-8 字符编码的文本文件的最有效方法是base64 /dev/urandom | head -c 10000000 | egrep -ao "\w" | tr -d '\n' > file10MB.txt

Answer

创建大小为 10 MB 且采用 UTF-8 字符编码的文本文件的最有效方法是base64 /dev/urandom | head -c 10000000 | egrep -ao "\w" | tr -d '\n' > file10MB.txt

Question 3

Linux/GNU 上的 ASCII（UTF-8 子集）字符的 Grep：

dd if=/dev/random bs=1 count=1G | egrep -ao "\w" | tr -d '\n'

Answer

Linux/GNU 上的 ASCII（UTF-8 子集）字符的 Grep：

dd if=/dev/random bs=1 count=1G | egrep -ao "\w" | tr -d '\n'

Question 4

如果您需要非 ASCII 字符，那么您需要一种方法来构建有效的 UTF-8 序列。两个连续字节产生有效 UTF-8 的可能性非常低。

相反，此 Python 脚本会创建随机 8 位值，这些值可以转换为 Unicode 字符，然后写为 UTF-8：

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

    while char_count <= 1000000 * 1024:
        rand_long = random.getrandbits(8)

        # Ignore control characters
        if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
            continue

        unicode_char = unichr(rand_long)
        my_file.write(unicode_char)
        char_count += 1

您还可以将其更改为使用随机 16 位数字，这将产生非拉丁值。

它并不快，但相当准确。

Answer

如果您需要非 ASCII 字符，那么您需要一种方法来构建有效的 UTF-8 序列。两个连续字节产生有效 UTF-8 的可能性非常低。

相反，此 Python 脚本会创建随机 8 位值，这些值可以转换为 Unicode 字符，然后写为 UTF-8：

import random
import io

char_count = 0

with io.open("random-utf8.txt", "w", encoding="utf-8") as my_file:

    while char_count <= 1000000 * 1024:
        rand_long = random.getrandbits(8)

        # Ignore control characters
        if rand_long <= 32 or (rand_long <= 0x9F and rand_long > 0x7F):
            continue

        unicode_char = unichr(rand_long)
        my_file.write(unicode_char)
        char_count += 1

您还可以将其更改为使用随机 16 位数字，这将产生非拉丁值。

它并不快，但相当准确。

如何创建包含 UTF-8 字符编码的随机字符的文本文件（1 GB）？

答案1

答案2

答案3

答案4

相关内容