如何知道 Linux 上 oom 错误的原因

Question 1

您如何定义 OOM 情况的“原因”？是使用最多内存的进程吗？也许您有一个 DB 总是需要 3GB 内存来运行，因此占用了机器上最多的内存。这是问题的“原因”吗？可能不是。

问题的根本原因是“意外情况，可能是也可能不是系统管理员的过错。”

有时您可以知道；例如，如果您有流程会计设置（+1 至@JamesHannah）并且您看到 3000 个 httpd 或 sshd 进程（这是不寻常的），您可能要归咎于该守护进程。

考虑到这一点，我提出以下来自消息来源的评论：

/*
 * oom_badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @p: current uptime in seconds
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

“因此，清算的理想候选对象是最近启动的非特权进程，该进程及其子进程使用大量内存，已得到良好处理，并且不执行任何原始 I/O。类似于 nohup 的并行内核构建（这不是一个坏选择，因为所有结果都保存到磁盘，并且当“make”终止时，很少会丢失工作）。”

评论屏蔽和引用无耻地窃取自http://linux-mm.org/OOM_Killer

Answer

您如何定义 OOM 情况的“原因”？是使用最多内存的进程吗？也许您有一个 DB 总是需要 3GB 内存来运行，因此占用了机器上最多的内存。这是问题的“原因”吗？可能不是。

问题的根本原因是“意外情况，可能是也可能不是系统管理员的过错。”

有时您可以知道；例如，如果您有流程会计设置（+1 至@JamesHannah）并且您看到 3000 个 httpd 或 sshd 进程（这是不寻常的），您可能要归咎于该守护进程。

考虑到这一点，我提出以下来自消息来源的评论：

/*
 * oom_badness - calculate a numeric value for how bad this task has been
 * @p: task struct of which task we should calculate
 * @p: current uptime in seconds
 *
 * The formula used is relatively simple and documented inline in the
 * function. The main rationale is that we want to select a good task
 * to kill when we run out of memory.
 *
 * Good in this context means that:
 * 1) we lose the minimum amount of work done
 * 2) we recover a large amount of memory
 * 3) we don't kill anything innocent of eating tons of memory
 * 4) we want to kill the minimum amount of processes (one)
 * 5) we try to kill the process the user expects us to kill, this
 *    algorithm has been meticulously tuned to meet the principle
 *    of least surprise ... (be careful when you change it)
 */

“因此，清算的理想候选对象是最近启动的非特权进程，该进程及其子进程使用大量内存，已得到良好处理，并且不执行任何原始 I/O。类似于 nohup 的并行内核构建（这不是一个坏选择，因为所有结果都保存到磁盘，并且当“make”终止时，很少会丢失工作）。”

评论屏蔽和引用无耻地窃取自http://linux-mm.org/OOM_Killer

Question 2

你可以通过运行以下命令查看哪些进程（带有 pid）被 OOM 终止程序考虑，以及哪些进程实际上被终止：消息. 但我不知道如何将其存入日志文件。

Answer

你可以通过运行以下命令查看哪些进程（带有 pid）被 OOM 终止程序考虑，以及哪些进程实际上被终止：消息. 但我不知道如何将其存入日志文件。

Question 3

只有在事件发生前安装了某种取证软件（例如 sysstat、psacct 或类似软件）才有可能。否则，您就一无所知了。

Answer

只有在事件发生前安装了某种取证软件（例如 sysstat、psacct 或类似软件）才有可能。否则，您就一无所知了。

Question 4

我们最近在 VMware 上运行的 RHEL 客户机上遇到了这个问题。如果您遇到同样的情况，请查看 VMware 的以下知识库文章：http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1002704

Answer

我们最近在 VMware 上运行的 RHEL 客户机上遇到了这个问题。如果您遇到同样的情况，请查看 VMware 的以下知识库文章：http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1002704

如何知道 Linux 上 oom 错误的原因

答案1

答案2

答案3

答案4

相关内容