为什么 GNU diff 占用这么多内存？

Question

我相信我找到了这种行为的原因。
似乎diff总是将整个文件读入内存。
老实说，我对此感到很惊讶。我没想到这是基于行的工具所必须的，但显然是的。

此信息基于此处的错误报告：https://debbugs.gnu.org/cgi/bugreport.cgi?bug=21665

Unfortunately I have found that diff reads the entire input files into
memory, leading to "/usr/bin/diff: memory exhausted" messages [...]

回复如下：

> Would you be open to patches that enable diffing large files by using
> mmap?

I doubt whether that would help that much, as it still needs to construct 
information about each line, and that information consumes memory too.  Doing 
this in secondary storage would be a bear.  In practice when I've run into this 
problem, I've either gotten a bigger machine or made my input lines shorter. 
Preferably the former.

最后

As Paul responded [...], using mmap seems
unlikely to help much, but if you write the patch and demonstrate that
it does make a difference, we'll be very interested, and I will
happily reopen the issue.

For now, I'm marking this as notabug and closing it.

在这种情况下，GNU diff 似乎在大文件处理方面会受到限制，除非有人找到办法克服错误报告中指出的困难，或者实现一个不同工作的 diff 工具。

如果有人提出更好或更深入的答案，例如通过代码审查，我会很乐意接受它。

PS：到目前为止，我自己使用基于 Python 的逐行阅读器只取得了中等程度的成功差异库，其目的是查找差异，而不是创建可修补的差异文件；它可以读取几个 GiB，但在某些时候似乎“不同步”，之后会报告实际上相同的行的差异。当然，它很慢。如果我能在某个时候建立一个可行的解决方案，我会发布源代码。

Answer 1