kern.log 中的“Page Fault Failed for pfn[0] = 0x0”是什么意思?

kern.log 中的“Page Fault Failed for pfn[0] = 0x0”是什么意思?

我最近开始在“kern.log”和 syslog 中打印以下元素。

Jan 29 10:28:19 server kernel: [82515.307047] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.315021] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.322996] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.330971] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.338944] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.346923] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.354905] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.362875] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.370855] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.378837] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.386824] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.394788] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.402766] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.410765] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.418722] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.426707] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.434693] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.442670] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.450634] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.458628] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.466590] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.474561] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.482551] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.490528] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.498500] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.506492] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.514463] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.522435] Page fault failed for pfn[0] = 0x0

我不知道它们是什么意思,但它们似乎持续很长时间,使日志变得非常大,并且通常导致系统无响应。

这可能与内存故障有关吗?我已经有一段时间没有更改与内存相关的任何内容了,到目前为止,系统已经运行了几个月,一切正常。

答案1

这段代码来自AMDGPU 驱动程序

for (i = 0; i < ttm->num_pages; i++) {
    /* FIXME: The pages cannot be touched outside the notifier_lock */
    pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
    if (unlikely(!pages[i])) {
        pr_err("Page fault failed for pfn[%lu] = 0x%llx\n",
               i, range->pfns[i]);
        r = -ENOMEM;

        goto out_free_pfns;

显然,unlikely()函数返回 true,对数组中第 i 个条目的内容进行求反pages,其中包含hmm_device_entry_to_page() 对于“用于解码设备条目值的范围”和“从中获取相应结构页面的设备条目值”。这据称会引发ENOMEMgpu 内存不足 ( ) 错误。基本上,您的 gpu 中存在内存错误,并且它抱怨内存不足。

相关内容