当我使用 4 个 GPU 进行神经网络训练时,我的机器关机并重启,原因是什么?

当我使用 4 个 GPU 进行神经网络训练时,我的机器关机并重启,原因是什么?

我的新 Linux 系统在使用所有 4 个 GPU 进行神经网络训练时会关闭。

nvidia-smi 表示温度和功率在正常范围内。我已经尝试跟踪这个问题好几个星期了。我查看了日志,但作为 Linux 新手,我无法准确判断哪些相关,哪些不相关。以下是一些输出

            [   11.404701] scsi 10:0:0:0: Direct-Access     Samsung  Portable SSD T3  0    PQ: 0 ANSI: 6
            [   11.407116] input: HDA Intel PCH Front Mic as /devices/pci0000:00/0000:00:1b.0/sound/card0/input6
            [   11.407165] input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1b.0/sound/card0/input7
            [   11.407199] input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1b.0/sound/card0/input8
            [   11.407226] input: HDA Intel PCH Line Out Front as /devices/pci0000:00/0000:00:1b.0/sound/card0/input9
            [   11.407254] input: HDA Intel PCH Line Out Surround as /devices/pci0000:00/0000:00:1b.0/sound/card0/input10
            [   11.407287] input: HDA Intel PCH Line Out CLFE as /devices/pci0000:00/0000:00:1b.0/sound/card0/input11
            [   11.407321] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1b.0/sound/card0/input12
            [   11.453914] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer
            [   11.453914] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
            [   11.453915] RAPL PMU: hw unit of domain package 2^-14 Joules
            [   11.453915] RAPL PMU: hw unit of domain dram 2^-16 Joules
            [   11.787404] AVX2 version of gcm_enc/dec engaged.
            [   11.787405] AES CTR mode by8 optimization enabled
            [   11.806169] kvm: disabled by bios
            [   11.847001] EDAC MC: Ver: 3.0.0
            [   11.848827] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
            [   11.848833] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
            [   11.848837] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
            [   11.848838] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
            [   11.848841] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
            [   11.848842] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
            [   11.848845] EDAC sbridge: Seeking for: PCI ID 8086:6f60
            [   11.848848] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
            [   11.848849] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
            [   11.848852] EDAC sbridge: Seeking for: PCI ID 8086:6f71
            [   11.848853] EDAC sbridge: Seeking for: PCI ID 8086:6f71
            [   11.848855] EDAC sbridge: Seeking for: PCI ID 8086:6faa
            [   11.848857] EDAC sbridge: Seeking for: PCI ID 8086:6faa
            [   11.848859] EDAC sbridge: Seeking for: PCI ID 8086:6fab
            [   11.848861] EDAC sbridge: Seeking for: PCI ID 8086:6fab
            [   11.848863] EDAC sbridge: Seeking for: PCI ID 8086:6fac
            [   11.848865] EDAC sbridge: Seeking for: PCI ID 8086:6fac
            [   11.848867] EDAC sbridge: Seeking for: PCI ID 8086:6fad
            [   11.848869] EDAC sbridge: Seeking for: PCI ID 8086:6fad
            [   11.848871] EDAC sbridge: Seeking for: PCI ID 8086:6faf
            [   11.848873] EDAC sbridge: Seeking for: PCI ID 8086:6faf
            [   11.848875] EDAC sbridge: Seeking for: PCI ID 8086:6f68
            [   11.848877] EDAC sbridge: Seeking for: PCI ID 8086:6f68
            [   11.848878] EDAC sbridge: Seeking for: PCI ID 8086:6f79
            [   11.848881] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
            [   11.848884] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
            [   11.848887] EDAC sbridge: Seeking for: PCI ID 8086:6f6c
            [   11.848890] EDAC sbridge: Seeking for: PCI ID 8086:6f6d
            [   11.848893] EDAC sbridge: ECC is disabled. Aborting
            [   11.849390] EDAC sbridge: Couldn't find mci handler
            [   11.849915] EDAC sbridge: Failed to register device with error -19.
            [   11.870580] sd 10:0:0:0: Attached scsi generic sg4 type 0
            [   11.870957] sd 10:0:0:0: [sde] 976773168 512-byte logical blocks: (500 GB/466 GiB)
            [   11.871036] sd 10:0:0:0: [sde] Write Protect is off
            [   11.871037] sd 10:0:0:0: [sde] Mode Sense: 43 00 00 00
            [   11.871196] sd 10:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
            [   11.873856]  sde: sde1
            [   11.874574] sd 10:0:0:0: [sde] Attached SCSI disk
            [   11.875753] intel_rapl: Found RAPL domain package
            [   12.535729] Adding 7906300k swap on /dev/sda2.  Priority:-1 extents:1 across:7906300k SSFS
            [   12.541868] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
            [   12.541872] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
            [   12.541878] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
            [   12.541879] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
            [   12.541881] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
            [   12.541882] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
            [   12.541884] EDAC sbridge: Seeking for: PCI ID 8086:6f60
            [   12.541886] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
            [   12.541887] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
            [   12.541889] EDAC sbridge: Seeking for: PCI ID 8086:6f71
            [   12.541890] EDAC sbridge: Seeking for: PCI ID 8086:6f71
            [   12.541891] EDAC sbridge: Seeking for: PCI ID 8086:6faa
            [   12.541892] EDAC sbridge: Seeking for: PCI ID 8086:6faa
            [   12.541894] EDAC sbridge: Seeking for: PCI ID 8086:6fab
            [   12.541895] EDAC sbridge: Seeking for: PCI ID 8086:6fab
            [   12.541897] EDAC sbridge: Seeking for: PCI ID 8086:6fac
            [   12.541898] EDAC sbridge: Seeking for: PCI ID 8086:6fac
            [   12.541900] EDAC sbridge: Seeking for: PCI ID 8086:6fad
            [   12.541901] EDAC sbridge: Seeking for: PCI ID 8086:6fad
            [   12.541902] EDAC sbridge: Seeking for: PCI ID 8086:6faf
            [   12.541903] EDAC sbridge: Seeking for: PCI ID 8086:6faf
            [   12.541905] EDAC sbridge: Seeking for: PCI ID 8086:6f68
            [   12.541906] EDAC sbridge: Seeking for: PCI ID 8086:6f68
            [   12.541908] EDAC sbridge: Seeking for: PCI ID 8086:6f79
            [   12.541910] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
            [   12.541912] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
            [   12.541915] EDAC sbridge: Seeking for: PCI ID 8086:6f6c
            [   12.541917] EDAC sbridge: Seeking for: PCI ID 8086:6f6d
            [   12.541920] EDAC sbridge: ECC is disabled. Aborting
            [   12.542266] EDAC sbridge: Couldn't find mci handler
            [   12.542609] EDAC sbridge: Failed to register device with error -19.
            [   12.583474] audit: type=1400 audit(1512326286.816:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/ippusbxd" pid=1065 comm="apparmor_parser"
            [   12.583632] audit: type=1400 audit(1512326286.816:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/cups-browsed" pid=1063 comm="apparmor_parser"
            [   12.583857] audit: type=1400 audit(1512326286.816:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1062 comm="apparmor_parser"
            [   12.583858] audit: type=1400 audit(1512326286.816:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1062 comm="apparmor_parser"
            [   12.583964] audit: type=1400 audit(1512326286.816:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/tcpdump" pid=1067 comm="apparmor_parser"
            [   12.584280] audit: type=1400 audit(1512326286.816:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=1057 comm="apparmor_parser"
            [   12.584282] audit: type=1400 audit(1512326286.816:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=1057 comm="apparmor_parser"
            [   12.584282] audit: type=1400 audit(1512326286.816:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=1057 comm="apparmor_parser"
            [   12.584283] audit: type=1400 audit(1512326286.816:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=1057 comm="apparmor_parser"
            [   12.607980] NVRM: Your system is not currently configured to drive a VGA console
                           on the primary VGA device. The NVIDIA Linux graphics driver
                           requires the use of a text-mode VGA console. Use of other console
                           drivers including, but not limited to, vesafb, may result in
                           corruption and stability problems, and is not supported.
            [   12.684737] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
            [   12.901781] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
            [   12.903895] IPv6: ADDRCONF(NETDEV_UP): enp14s0: link is not ready
            [   12.948573] IPv6: ADDRCONF(NETDEV_UP): enp14s0: link is not ready
            [   13.718428] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input17
            [   13.718516] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input18
            [   13.718584] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input19
            [   13.718651] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input20
            [   13.718678] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input25
            [   13.718865] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input26
            [   13.721913] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input27
            [   13.722090] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input13
            [   13.722271] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input28
            [   13.723585] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input21
            [   13.723698] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input14
            [   13.723996] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input15
            [   13.724157] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input22
            [   13.724308] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input23
            [   13.724459] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input16
            [   13.724633] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input24
            [   14.576237] igb 0000:0e:00.0 enp14s0: igb: enp14s0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
            [   14.576913] IPv6: ADDRCONF(NETDEV_CHANGE): enp14s0: link becomes ready
            [   15.892367] nvidia-modeset: Allocated GPU:0 (GPU-bffcceff-4388-ca5f-54f1-661f60ed3f91) @ PCI:0000:05:00.0
            [   16.733705] nvidia-modeset: Allocated GPU:1 (GPU-7385a2d1-521a-231a-3552-488574e039f1) @ PCI:0000:0a:00.0
            [   17.367479] nvidia-modeset: Allocated GPU:2 (GPU-709514aa-7aaf-e268-56cf-b986961602e4) @ PCI:0000:09:00.0
            [   18.003041] nvidia-modeset: Allocated GPU:3 (GPU-61598621-29d4-8d9b-abfc-87f004019f3e) @ PCI:0000:06:00.0

我很抱歉提供这么多信息 - 我怀疑这个问题与我的 GPU 有关。我的电源是 1600W,所以应该没问题。

相关内容