生成包含随机数字的 1 GB 文本文件的最快方法是什么？

Question 1

这：

 LC_ALL=C tr '\0-\377' \
             '[0*25][1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][x*]' \
    < /dev/urandom |
    tr -d x |
    fold -w 1 |
    paste -sd "$(printf '%99s\\n')" - |
    head -c1G

（假设head支持的实现-c）在我的系统上似乎相当快。

tr转换整个字节范围（0 到 255，八进制的 0 到 0377）：前 25 个字节为 0，接下来的 25 个字节为 1...然后 25 9 其余的（250 到 255）为“x”，然后我们将其转换为“x”丢弃（与tr -d x），因为我们想要均匀分布（假设/dev/urandom本身具有均匀分布），因此不要给某些数字带来偏差。

这会为 97% 的字节生成一位数字/dev/urandom。fold -w 1使其每行一位数。paste -s使用由 99 个空格字符和一个换行符组成的分隔符列表进行调用，因此每行上有 100 个空格分隔的数字。

head -c1G将获得其中的第一个 GiB (2 ³⁰ )。请注意，最后一行将被截断且不定界。您可以截断为 2 ³⁰ -1 并手动添加缺少的换行符，或者截断为 10 ⁹字节，即 200 字节行中的 5000 万个（head -n 50000000也将使其成为标准/可移植命令）。

这些计时（zsh在四核系统上获得）指示了 CPU 时间花费在何处：

LC_ALL=C tr '\0-\377'  < /dev/urandom  0.61s user 31.28s system 99% cpu 31.904 total
tr -d x  1.00s user 0.27s system 3% cpu 31.903 total
fold -w 1  14.93s user 0.48s system 48% cpu 31.902 total
paste -sd "$(printf '%99s\\n')" -  7.23s user 0.08s system 22% cpu 31.899 total
head -c1G > /dev/null  0.49s user 1.21s system 5% cpu 31.898 total

第一个tr是瓶颈，大部分时间都花在内核上（我想是用于随机数生成）。时间大致与我可以获取字节的速率一致/dev/uramdom（大约 19MiB/s，这里我们以 32MiB/s 的速率为 /dev/urandom 的每 0.97 字节生成 2 个字节）。fold似乎花费了不合理的 CPU 时间（15 秒）只是为了在每个字节后插入换行符，但这并不影响总时间，因为在我的情况下它在不同的 CPU 上工作（添加该-b选项使其稍微多一点）高效，dd cbs=1 conv=unblock似乎是一个更好的选择）。

您可以通过在子 shell 中head -c1G设置文件大小限制（limit filesize 1024m使用zsh或ulimit -f "$((1024*1024))"使用大多数其他 shell（包括））来取消并节省几秒钟的时间。zsh

如果我们为每个字节提取 2 位数字，情况可能会有所改善，但我们需要采用不同的方法。上面的代码非常高效，因为tr只需查找 256 字节数组中的每个字节。它不能一次对 2 个字节执行此操作，并且使用类似的方法hexdump -e '1/1 "%02u"'使用更复杂的算法来计算字节的文本表示形式将比随机数生成本身更昂贵。不过，如果像我的情况一样，你有空闲的 CPU 核心，它仍然可以节省几秒钟的时间：

和：

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -n250000000 -ve '500/1 "%02u" "\n"' |
  fold -w1 |
  paste -sd "$(printf '%99s\\n')" - > /dev/null

我得到（但请注意，这里是 1,000,000,000 字节，而不是 1,073,741,824）：

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 18.83s system 70% cpu 27.001 total
tr -d x  2.17s user 0.09s system 8% cpu 27.000 total
hexdump -n250000000 -ve '500/1 "%02u" "\n"'  26.79s user 0.17s system 99% cpu 27.000 total
fold -w1  14.42s user 0.67s system 55% cpu 27.000 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  8.00s user 0.23s system 30% cpu 26.998 total

总体而言，CPU 时间更长，但在 4 个 CPU 核心之间分配得更好，因此最终占用的挂钟时间更少。瓶颈就在现在hexdump。

如果我们使用dd而不是基于行fold，我们实际上可以减少需要做的工作量hexdump并提高CPU之间的工作平衡：

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

（这里假设 GNUdd为iflag=fullblock和status=none），给出：

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 15.58s system 99% cpu 15.915 total
tr -d x  1.62s user 0.16s system 11% cpu 15.914 total
hexdump -ve '"%02u"'  10.90s user 0.32s system 70% cpu 15.911 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  5.44s user 0.19s system 35% cpu 15.909 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  5.50s user 0.30s system 36% cpu 15.905 total

回到瓶颈的随机数生成。

现在，正如 @OleTange 所指出的，如果您有该openssl实用程序，您可以使用它来获得更快的（尤其是在具有 AES 指令的处理器上）伪随机字节生成器。

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom

在我的系统上每秒喷出的字节数是/dev/urandom. （我无法评论它在以下方面的比较加密安全的随机源如果这适用于您的用例）。

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

现在给出：

openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom < /dev/zero 2>   1.13s user 0.16s system 12% cpu 10.174 total
LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]'  0.56s user 0.20s system 7% cpu 10.173 total
tr -d x  2.50s user 0.10s system 25% cpu 10.172 total
hexdump -ve '"%02u"'  9.96s user 0.19s system 99% cpu 10.172 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  4.38s user 0.20s system 45% cpu 10.171 total
paste -sd "$(printf '%99s\\n')" - > /dev/null

又回到hexdump了瓶颈。

由于我还有空闲的 CPU，因此我可以并行运行其中 3 个hexdump。

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  (hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"') 3<&0 |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

（除了在后台运行时在 /dev/null 上关闭命令的 stdin<&3之外，其他 shell 都需要该命令）。zsh

现在降至 6.2 秒，我的 CPU 几乎完全被利用。

Answer

这：

 LC_ALL=C tr '\0-\377' \
             '[0*25][1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][x*]' \
    < /dev/urandom |
    tr -d x |
    fold -w 1 |
    paste -sd "$(printf '%99s\\n')" - |
    head -c1G

（假设head支持的实现-c）在我的系统上似乎相当快。

tr转换整个字节范围（0 到 255，八进制的 0 到 0377）：前 25 个字节为 0，接下来的 25 个字节为 1...然后 25 9 其余的（250 到 255）为“x”，然后我们将其转换为“x”丢弃（与tr -d x），因为我们想要均匀分布（假设/dev/urandom本身具有均匀分布），因此不要给某些数字带来偏差。

这会为 97% 的字节生成一位数字/dev/urandom。fold -w 1使其每行一位数。paste -s使用由 99 个空格字符和一个换行符组成的分隔符列表进行调用，因此每行上有 100 个空格分隔的数字。

head -c1G将获得其中的第一个 GiB (2 ³⁰ )。请注意，最后一行将被截断且不定界。您可以截断为 2 ³⁰ -1 并手动添加缺少的换行符，或者截断为 10 ⁹字节，即 200 字节行中的 5000 万个（head -n 50000000也将使其成为标准/可移植命令）。

这些计时（zsh在四核系统上获得）指示了 CPU 时间花费在何处：

LC_ALL=C tr '\0-\377'  < /dev/urandom  0.61s user 31.28s system 99% cpu 31.904 total
tr -d x  1.00s user 0.27s system 3% cpu 31.903 total
fold -w 1  14.93s user 0.48s system 48% cpu 31.902 total
paste -sd "$(printf '%99s\\n')" -  7.23s user 0.08s system 22% cpu 31.899 total
head -c1G > /dev/null  0.49s user 1.21s system 5% cpu 31.898 total

第一个tr是瓶颈，大部分时间都花在内核上（我想是用于随机数生成）。时间大致与我可以获取字节的速率一致/dev/uramdom（大约 19MiB/s，这里我们以 32MiB/s 的速率为 /dev/urandom 的每 0.97 字节生成 2 个字节）。fold似乎花费了不合理的 CPU 时间（15 秒）只是为了在每个字节后插入换行符，但这并不影响总时间，因为在我的情况下它在不同的 CPU 上工作（添加该-b选项使其稍微多一点）高效，dd cbs=1 conv=unblock似乎是一个更好的选择）。

您可以通过在子 shell 中head -c1G设置文件大小限制（limit filesize 1024m使用zsh或ulimit -f "$((1024*1024))"使用大多数其他 shell（包括））来取消并节省几秒钟的时间。zsh

如果我们为每个字节提取 2 位数字，情况可能会有所改善，但我们需要采用不同的方法。上面的代码非常高效，因为tr只需查找 256 字节数组中的每个字节。它不能一次对 2 个字节执行此操作，并且使用类似的方法hexdump -e '1/1 "%02u"'使用更复杂的算法来计算字节的文本表示形式将比随机数生成本身更昂贵。不过，如果像我的情况一样，你有空闲的 CPU 核心，它仍然可以节省几秒钟的时间：

和：

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -n250000000 -ve '500/1 "%02u" "\n"' |
  fold -w1 |
  paste -sd "$(printf '%99s\\n')" - > /dev/null

我得到（但请注意，这里是 1,000,000,000 字节，而不是 1,073,741,824）：

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 18.83s system 70% cpu 27.001 total
tr -d x  2.17s user 0.09s system 8% cpu 27.000 total
hexdump -n250000000 -ve '500/1 "%02u" "\n"'  26.79s user 0.17s system 99% cpu 27.000 total
fold -w1  14.42s user 0.67s system 55% cpu 27.000 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  8.00s user 0.23s system 30% cpu 26.998 total

总体而言，CPU 时间更长，但在 4 个 CPU 核心之间分配得更好，因此最终占用的挂钟时间更少。瓶颈就在现在hexdump。

如果我们使用dd而不是基于行fold，我们实际上可以减少需要做的工作量hexdump并提高CPU之间的工作平衡：

< /dev/urandom LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

（这里假设 GNUdd为iflag=fullblock和status=none），给出：

LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' < /dev/urandom  0.32s user 15.58s system 99% cpu 15.915 total
tr -d x  1.62s user 0.16s system 11% cpu 15.914 total
hexdump -ve '"%02u"'  10.90s user 0.32s system 70% cpu 15.911 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  5.44s user 0.19s system 35% cpu 15.909 total
paste -sd "$(printf '%99s\\n')" - > /dev/null  5.50s user 0.30s system 36% cpu 15.905 total

回到瓶颈的随机数生成。

现在，正如 @OleTange 所指出的，如果您有该openssl实用程序，您可以使用它来获得更快的（尤其是在具有 AES 指令的处理器上）伪随机字节生成器。

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom

在我的系统上每秒喷出的字节数是/dev/urandom. （我无法评论它在以下方面的比较加密安全的随机源如果这适用于您的用例）。

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

现在给出：

openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom < /dev/zero 2>   1.13s user 0.16s system 12% cpu 10.174 total
LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]'  0.56s user 0.20s system 7% cpu 10.173 total
tr -d x  2.50s user 0.10s system 25% cpu 10.172 total
hexdump -ve '"%02u"'  9.96s user 0.19s system 99% cpu 10.172 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  4.38s user 0.20s system 45% cpu 10.171 total
paste -sd "$(printf '%99s\\n')" - > /dev/null

又回到hexdump了瓶颈。

由于我还有空闲的 CPU，因此我可以并行运行其中 3 个hexdump。

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '\0-\377' '\0-\143\0-\143[x*]' |
  tr -d x |
  (hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"') 3<&0 |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\\n')" -

（除了在后台运行时在 /dev/null 上关闭命令的 stdin<&3之外，其他 shell 都需要该命令）。zsh

现在降至 6.2 秒，我的 CPU 几乎完全被利用。

Question 2

由于问题的标题，这在一定程度上是一个半开玩笑的答案。

当你寻找“最快的方法是……”，答案几乎总是一些专门的工具。这个“答案”展示了一个这样的工具，以便您可以进行实验。

这不是一个严肃的答案，因为您不应该为只做一次或很少做的工作寻找专门的工具。你看，你最终会花更多的时间寻找工具并学习它们，而不是实际做事。 Shell 和实用程序（例如bash和）awk不是最快的，但您通常可以编写单行只需几秒钟即可完成这项工作。也可以使用像这样更好的脚本语言perl，尽管学习曲线perl很陡，而且我犹豫是否推荐它用于此类目的，因为我已经被糟糕的 Perl 项目所创伤。python另一方面，由于 I/O 速度较慢，因此稍有缺陷；不过，只有当您过滤或生成千兆字节的数据时，这才成为问题。

无论如何，以下 C89 示例程序（仅在可用时才使用 POSIX.1 来获得更高精度的时钟）应达到约 100 MB/s 的生成速率（在配备 Intel i5-4200U 处理器的笔记本电脑上的 Linux 中测试，通过管道输出到/dev/null），使用一个相当好的伪随机数生成器。（输出应该通过所有 BigCrunch 测试，除了 MatrixRank 测试，因为代码使用异或移位64*以及避免数字偏差的排除方法。）

十进制数字.c：

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

/* This program is licensed under the CC0 license,
       https://creativecommons.org/publicdomain/zero/1.0/
   In other words, this is dedicated to the public domain.
   There are no warranties either, so if something breaks,
   you only have yourself to blame.
*/

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    for (line = 0UL; line < lines; line++) {
        fputc('0' + digit(), stdout);
        for (col = 1UL; col < cols; col++) {
            fputc(' ', stdout);
            fputc('0' + digit(), stdout);
        }
        fputc('\n', stdout);

        /* Check for write errors. */
        if (ferror(stdout))
            return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

fwrite()如果我们切换到行缓冲区，并且它一次而不是一次输出每个数字，我们可以使它更快。请注意，如果输出是块设备，我们仍然保持流完全缓冲，以避免部分（非二次幂）写入。

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    char         *oneline;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    /* Allocate memory for a full line. */
    oneline = malloc((size_t)(2 * cols + 1));
    if (!oneline) {
        fprintf(stderr, "Not enough memory for %lu column buffer.\n", cols);
        return EXIT_FAILURE;
    }

    /* Set spaces and terminating newline. */
    for (col = 0; col < cols; col++)
        oneline[2*col + 1] = ' ';
    oneline[2*cols-1] = '\n';

    /* Not needed, but in case a code modification treats it as a string. */
    oneline[2*cols] = '\0';

    for (line = 0UL; line < lines; line++) {
        for (col = 0UL; col < cols; col++)
            oneline[2*col] = digit();

        if (fwrite(oneline, 2*cols, 1, stdout) != 1)
            return EXIT_FAILURE; 
    }

    /* Check for write errors. */
    if (ferror(stdout))
        return EXIT_FAILURE;

    return EXIT_SUCCESS;
}

注意：两个示例均于 2016 年 11 月 18 日编辑为确保数字的均匀分布（排除零；参见例如这里有关各种伪随机数生成器的比较和详细信息）。

编译使用例如

gcc -Wall -O2 decimal-digits.c -o decimal-digits

并可选择在系统范围内安装以/usr/bin使用

sudo install -o root -g root -m 0755 decimal-digits /usr/bin

它需要每行的位数和行数。因为1000000000 / 100 / 2 = 5000000（五百万；总字节数除以列数除以 2），您可以使用

./decimal-digits 100 5000000 > digits.txt

digits.txt根据OP的需要生成千兆字节大小的数据。

请注意，编写程序本身更多的是考虑可读性而不是效率。我在这里的目的不是展示代码的效率——无论如何，我会使用 POSIX.1 和低级 I/O，而不是通用 C 接口——而是让您轻松了解所花费的精力之间的平衡与单行代码、短 shell 或 awk scriptlet 相比，开发专用工具及其性能。

使用 GNU C 库，fputc()为每个字符输出调用函数会产生非常小的开销（间接函数调用或条件 -FILE您会看到，该接口实际上非常复杂且通用）。在这台特定的 Intel Core i5-4200U 笔记本电脑上，将输出重定向到/dev/null第一个 (fputc) 版本大约需要 11 秒，而一次一行版本只需 1.3 秒。

我碰巧经常编写这样的程序和生成器只是因为我喜欢使用巨大的数据集。我这样很奇怪。例如，我曾经编写过一个程序，将所有有限正 IEEE-754 浮点值打印到文本文件中，并具有足够的精度，以便在解析时产生完全相同的值。该文件大小为几 GB（也许 4G 左右）；有限正数并不float像人们想象的那么多。我用它来比较读取和解析此类数据的实现。

对于正常用例，就像 OP 所拥有的那样，shell 脚本、scriptlet 和单行代码是更好的方法。完成总体任务所需的时间更少。（除非他们每天都需要不同的文件，或者有很多人需要不同的文件，在这种情况下，像上面这样的专用工具可能值得付出努力。）

Answer

由于问题的标题，这在一定程度上是一个半开玩笑的答案。

当你寻找“最快的方法是……”，答案几乎总是一些专门的工具。这个“答案”展示了一个这样的工具，以便您可以进行实验。

这不是一个严肃的答案，因为您不应该为只做一次或很少做的工作寻找专门的工具。你看，你最终会花更多的时间寻找工具并学习它们，而不是实际做事。 Shell 和实用程序（例如bash和）awk不是最快的，但您通常可以编写单行只需几秒钟即可完成这项工作。也可以使用像这样更好的脚本语言perl，尽管学习曲线perl很陡，而且我犹豫是否推荐它用于此类目的，因为我已经被糟糕的 Perl 项目所创伤。python另一方面，由于 I/O 速度较慢，因此稍有缺陷；不过，只有当您过滤或生成千兆字节的数据时，这才成为问题。

无论如何，以下 C89 示例程序（仅在可用时才使用 POSIX.1 来获得更高精度的时钟）应达到约 100 MB/s 的生成速率（在配备 Intel i5-4200U 处理器的笔记本电脑上的 Linux 中测试，通过管道输出到/dev/null），使用一个相当好的伪随机数生成器。（输出应该通过所有 BigCrunch 测试，除了 MatrixRank 测试，因为代码使用异或移位64*以及避免数字偏差的排除方法。）

十进制数字.c：

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

/* This program is licensed under the CC0 license,
       https://creativecommons.org/publicdomain/zero/1.0/
   In other words, this is dedicated to the public domain.
   There are no warranties either, so if something breaks,
   you only have yourself to blame.
*/

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    for (line = 0UL; line < lines; line++) {
        fputc('0' + digit(), stdout);
        for (col = 1UL; col < cols; col++) {
            fputc(' ', stdout);
            fputc('0' + digit(), stdout);
        }
        fputc('\n', stdout);

        /* Check for write errors. */
        if (ferror(stdout))
            return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

fwrite()如果我们切换到行缓冲区，并且它一次而不是一次输出每个数字，我们可以使它更快。请注意，如果输出是块设备，我们仍然保持流完全缓冲，以避免部分（非二次幂）写入。

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    char         *oneline;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    /* Allocate memory for a full line. */
    oneline = malloc((size_t)(2 * cols + 1));
    if (!oneline) {
        fprintf(stderr, "Not enough memory for %lu column buffer.\n", cols);
        return EXIT_FAILURE;
    }

    /* Set spaces and terminating newline. */
    for (col = 0; col < cols; col++)
        oneline[2*col + 1] = ' ';
    oneline[2*cols-1] = '\n';

    /* Not needed, but in case a code modification treats it as a string. */
    oneline[2*cols] = '\0';

    for (line = 0UL; line < lines; line++) {
        for (col = 0UL; col < cols; col++)
            oneline[2*col] = digit();

        if (fwrite(oneline, 2*cols, 1, stdout) != 1)
            return EXIT_FAILURE; 
    }

    /* Check for write errors. */
    if (ferror(stdout))
        return EXIT_FAILURE;

    return EXIT_SUCCESS;
}

注意：两个示例均于 2016 年 11 月 18 日编辑为确保数字的均匀分布（排除零；参见例如这里有关各种伪随机数生成器的比较和详细信息）。

编译使用例如

gcc -Wall -O2 decimal-digits.c -o decimal-digits

并可选择在系统范围内安装以/usr/bin使用

sudo install -o root -g root -m 0755 decimal-digits /usr/bin

它需要每行的位数和行数。因为1000000000 / 100 / 2 = 5000000（五百万；总字节数除以列数除以 2），您可以使用

./decimal-digits 100 5000000 > digits.txt

digits.txt根据OP的需要生成千兆字节大小的数据。

请注意，编写程序本身更多的是考虑可读性而不是效率。我在这里的目的不是展示代码的效率——无论如何，我会使用 POSIX.1 和低级 I/O，而不是通用 C 接口——而是让您轻松了解所花费的精力之间的平衡与单行代码、短 shell 或 awk scriptlet 相比，开发专用工具及其性能。

使用 GNU C 库，fputc()为每个字符输出调用函数会产生非常小的开销（间接函数调用或条件 -FILE您会看到，该接口实际上非常复杂且通用）。在这台特定的 Intel Core i5-4200U 笔记本电脑上，将输出重定向到/dev/null第一个 (fputc) 版本大约需要 11 秒，而一次一行版本只需 1.3 秒。

我碰巧经常编写这样的程序和生成器只是因为我喜欢使用巨大的数据集。我这样很奇怪。例如，我曾经编写过一个程序，将所有有限正 IEEE-754 浮点值打印到文本文件中，并具有足够的精度，以便在解析时产生完全相同的值。该文件大小为几 GB（也许 4G 左右）；有限正数并不float像人们想象的那么多。我用它来比较读取和解析此类数据的实现。

对于正常用例，就像 OP 所拥有的那样，shell 脚本、scriptlet 和单行代码是更好的方法。完成总体任务所需的时间更少。（除非他们每天都需要不同的文件，或者有很多人需要不同的文件，在这种情况下，像上面这样的专用工具可能值得付出努力。）

Question 3

如果你有shuf可用（最近的 GNU coreutils 可以）你可以这样做：

time shuf -r -n $((512*1024*1024)) -i 0-9 | paste -sd "$(printf '%99s\\n')" -

在我的虚拟机上，这现在比 Stéphane 的答案慢了大约 3:4。

Answer

如果你有shuf可用（最近的 GNU coreutils 可以）你可以这样做：

time shuf -r -n $((512*1024*1024)) -i 0-9 | paste -sd "$(printf '%99s\\n')" -

在我的虚拟机上，这现在比 Stéphane 的答案慢了大约 3:4。

Question 4

这是我希望简单易懂的解决方案：

od -An -x /dev/urandom | tr -dc 0-9 | fold -w100 | awk NF=NF FS= | head -c1G

od从中创建统一的十六进制数字流/dev/random。
tr去掉字母，只保留0-9数字
fold确保每行有 100 个数字
awk在行内插入空格
head将输入截断为 1 GB

Answer

这是我希望简单易懂的解决方案：

od -An -x /dev/urandom | tr -dc 0-9 | fold -w100 | awk NF=NF FS= | head -c1G

od从中创建统一的十六进制数字流/dev/random。
tr去掉字母，只保留0-9数字
fold确保每行有 100 个数字
awk在行内插入空格
head将输入截断为 1 GB

生成包含随机数字的 1 GB 文本文件的最快方法是什么？

答案1

答案2

十进制数字.c：

答案3

答案4

相关内容