我的新 Linux 系统在使用所有 4 个 GPU 进行神经网络训练时会关闭。
nvidia-smi 表示温度和功率在正常范围内。我已经尝试跟踪这个问题好几个星期了。我查看了日志,但作为 Linux 新手,我无法准确判断哪些相关,哪些不相关。以下是一些输出
[ 11.404701] scsi 10:0:0:0: Direct-Access Samsung Portable SSD T3 0 PQ: 0 ANSI: 6
[ 11.407116] input: HDA Intel PCH Front Mic as /devices/pci0000:00/0000:00:1b.0/sound/card0/input6
[ 11.407165] input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1b.0/sound/card0/input7
[ 11.407199] input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1b.0/sound/card0/input8
[ 11.407226] input: HDA Intel PCH Line Out Front as /devices/pci0000:00/0000:00:1b.0/sound/card0/input9
[ 11.407254] input: HDA Intel PCH Line Out Surround as /devices/pci0000:00/0000:00:1b.0/sound/card0/input10
[ 11.407287] input: HDA Intel PCH Line Out CLFE as /devices/pci0000:00/0000:00:1b.0/sound/card0/input11
[ 11.407321] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1b.0/sound/card0/input12
[ 11.453914] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer
[ 11.453914] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
[ 11.453915] RAPL PMU: hw unit of domain package 2^-14 Joules
[ 11.453915] RAPL PMU: hw unit of domain dram 2^-16 Joules
[ 11.787404] AVX2 version of gcm_enc/dec engaged.
[ 11.787405] AES CTR mode by8 optimization enabled
[ 11.806169] kvm: disabled by bios
[ 11.847001] EDAC MC: Ver: 3.0.0
[ 11.848827] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 11.848833] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 11.848837] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 11.848838] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 11.848841] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 11.848842] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 11.848845] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[ 11.848848] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 11.848849] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 11.848852] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 11.848853] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 11.848855] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 11.848857] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 11.848859] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 11.848861] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 11.848863] EDAC sbridge: Seeking for: PCI ID 8086:6fac
[ 11.848865] EDAC sbridge: Seeking for: PCI ID 8086:6fac
[ 11.848867] EDAC sbridge: Seeking for: PCI ID 8086:6fad
[ 11.848869] EDAC sbridge: Seeking for: PCI ID 8086:6fad
[ 11.848871] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 11.848873] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 11.848875] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 11.848877] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 11.848878] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[ 11.848881] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[ 11.848884] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[ 11.848887] EDAC sbridge: Seeking for: PCI ID 8086:6f6c
[ 11.848890] EDAC sbridge: Seeking for: PCI ID 8086:6f6d
[ 11.848893] EDAC sbridge: ECC is disabled. Aborting
[ 11.849390] EDAC sbridge: Couldn't find mci handler
[ 11.849915] EDAC sbridge: Failed to register device with error -19.
[ 11.870580] sd 10:0:0:0: Attached scsi generic sg4 type 0
[ 11.870957] sd 10:0:0:0: [sde] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[ 11.871036] sd 10:0:0:0: [sde] Write Protect is off
[ 11.871037] sd 10:0:0:0: [sde] Mode Sense: 43 00 00 00
[ 11.871196] sd 10:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 11.873856] sde: sde1
[ 11.874574] sd 10:0:0:0: [sde] Attached SCSI disk
[ 11.875753] intel_rapl: Found RAPL domain package
[ 12.535729] Adding 7906300k swap on /dev/sda2. Priority:-1 extents:1 across:7906300k SSFS
[ 12.541868] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 12.541872] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 12.541878] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 12.541879] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 12.541881] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 12.541882] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 12.541884] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[ 12.541886] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 12.541887] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 12.541889] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 12.541890] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 12.541891] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 12.541892] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 12.541894] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 12.541895] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 12.541897] EDAC sbridge: Seeking for: PCI ID 8086:6fac
[ 12.541898] EDAC sbridge: Seeking for: PCI ID 8086:6fac
[ 12.541900] EDAC sbridge: Seeking for: PCI ID 8086:6fad
[ 12.541901] EDAC sbridge: Seeking for: PCI ID 8086:6fad
[ 12.541902] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 12.541903] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 12.541905] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 12.541906] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 12.541908] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[ 12.541910] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[ 12.541912] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[ 12.541915] EDAC sbridge: Seeking for: PCI ID 8086:6f6c
[ 12.541917] EDAC sbridge: Seeking for: PCI ID 8086:6f6d
[ 12.541920] EDAC sbridge: ECC is disabled. Aborting
[ 12.542266] EDAC sbridge: Couldn't find mci handler
[ 12.542609] EDAC sbridge: Failed to register device with error -19.
[ 12.583474] audit: type=1400 audit(1512326286.816:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/ippusbxd" pid=1065 comm="apparmor_parser"
[ 12.583632] audit: type=1400 audit(1512326286.816:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/cups-browsed" pid=1063 comm="apparmor_parser"
[ 12.583857] audit: type=1400 audit(1512326286.816:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=1062 comm="apparmor_parser"
[ 12.583858] audit: type=1400 audit(1512326286.816:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=1062 comm="apparmor_parser"
[ 12.583964] audit: type=1400 audit(1512326286.816:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/tcpdump" pid=1067 comm="apparmor_parser"
[ 12.584280] audit: type=1400 audit(1512326286.816:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=1057 comm="apparmor_parser"
[ 12.584282] audit: type=1400 audit(1512326286.816:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=1057 comm="apparmor_parser"
[ 12.584282] audit: type=1400 audit(1512326286.816:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=1057 comm="apparmor_parser"
[ 12.584283] audit: type=1400 audit(1512326286.816:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=1057 comm="apparmor_parser"
[ 12.607980] NVRM: Your system is not currently configured to drive a VGA console
on the primary VGA device. The NVIDIA Linux graphics driver
requires the use of a text-mode VGA console. Use of other console
drivers including, but not limited to, vesafb, may result in
corruption and stability problems, and is not supported.
[ 12.684737] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[ 12.901781] IPv6: ADDRCONF(NETDEV_UP): eno1: link is not ready
[ 12.903895] IPv6: ADDRCONF(NETDEV_UP): enp14s0: link is not ready
[ 12.948573] IPv6: ADDRCONF(NETDEV_UP): enp14s0: link is not ready
[ 13.718428] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input17
[ 13.718516] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input18
[ 13.718584] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input19
[ 13.718651] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:08.0/0000:06:00.1/sound/card3/input20
[ 13.718678] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input25
[ 13.718865] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input26
[ 13.721913] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input27
[ 13.722090] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input13
[ 13.722271] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:08.0/0000:0a:00.1/sound/card1/input28
[ 13.723585] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input21
[ 13.723698] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input14
[ 13.723996] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input15
[ 13.724157] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input22
[ 13.724308] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input23
[ 13.724459] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input16
[ 13.724633] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input24
[ 14.576237] igb 0000:0e:00.0 enp14s0: igb: enp14s0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX
[ 14.576913] IPv6: ADDRCONF(NETDEV_CHANGE): enp14s0: link becomes ready
[ 15.892367] nvidia-modeset: Allocated GPU:0 (GPU-bffcceff-4388-ca5f-54f1-661f60ed3f91) @ PCI:0000:05:00.0
[ 16.733705] nvidia-modeset: Allocated GPU:1 (GPU-7385a2d1-521a-231a-3552-488574e039f1) @ PCI:0000:0a:00.0
[ 17.367479] nvidia-modeset: Allocated GPU:2 (GPU-709514aa-7aaf-e268-56cf-b986961602e4) @ PCI:0000:09:00.0
[ 18.003041] nvidia-modeset: Allocated GPU:3 (GPU-61598621-29d4-8d9b-abfc-87f004019f3e) @ PCI:0000:06:00.0
我很抱歉提供这么多信息 - 我怀疑这个问题与我的 GPU 有关。我的电源是 1600W,所以应该没问题。