我所在的公司有大约 100 台 Ubuntu 18.04 服务器机器,它们分布在美国各地,是我们产品线的一部分。这些机器在过去 1-2 年里都没有出现任何问题,直到上周才出现。在过去 5 天里,有 6 台机器出现了严重错误,最终导致文件系统变成只读。
我终于可以物理访问其中一个设备了。我在 DMESG 中发现以下内容:EXT4-fs (dm-0): Couldn't remount RDWR because of unprocessed orphan inode list. Please umount/remount instead
运行 fsck.ext4 -n /dev/sda2(根分区)会产生多个孤立的 inode。我确信 fsck 可以修复它,但我更想知道是什么导致了这个问题。
我也在系统日志中发现了一些内核错误:
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937302] BUG: unable to handle kernel paging request at ffff93cdf5ef2eaa
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937348] IP: kmem_cache_alloc+0x7a/0x1c0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937360] PGD 87d99067 P4D 87d99067 PUD 0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937383] Oops: 0000 [#3] SMP PTI
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937395] Modules linked in: ccm intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp kvm_intel arc4 kvm irqbypass snd_hda_codec_hdmi punit_atom_debug joydev iwlmvm snd_hda_codec_realtek intel_cstate snd_hda_codec_generic mac80211 snd_hda_intel iwlwifi snd_hda_codec snd_hda_core snd_hwdep hid_multitouch input_leds cfg80211 snd_pcm ftdi_sio lpc_ich serio_raw snd_timer usbserial btusb cdc_acm btrtl snd mei_txe shpchp mei soundcore hci_uart btbcm btqca btintel rfkill_gpio bluetooth ecdh_generic pwm_lpss_platform pwm_lpss mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937571] raid0 multipath linear hid_generic usbhid i915 crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel cryptd syscopyarea sysfillrect igb sysimgblt psmouse fb_sys_fops dca i2c_algo_bit drm ptp pps_core ahci libahci video i2c_hid hid
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937646] CPU: 0 PID: 1212 Comm: uwsgi Tainted: G D 4.15.0-151-generic #157-Ubuntu
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937657] Hardware name: Winmate Inc. IB3S/IB32S, BIOS V210 05/06/2019
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937676] RIP: 0010:kmem_cache_alloc+0x7a/0x1c0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937689] RSP: 0018:ffffb7b6c1207d58 EFLAGS: 00010286
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937703] RAX: ffff93cdf5ef2eaa RBX: 0000000000000000 RCX: 0000000000000000
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937715] RDX: 0000000000009791 RSI: 00000000014080c0 RDI: 0000440bc0024800
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937727] RBP: ffffb7b6c1207d88 R08: ffffd7b6bfc24800 R09: ffff93aaf1400c00
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937738] R10: 0000000000000010 R11: 0000000000026d00 R12: ffff93cdf5ef2eaa
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937750] R13: 00000000014080c0 R14: ffff93aafb017800 R15: ffff93aaf1405e00
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937765] FS: 00007fe86c207740(0000) GS:ffff93aaffc00000(0000) knlGS:0000000000000000
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937789] CR2: ffff93cdf5ef2eaa CR3: 00000001314ce000 CR4: 00000000001006f0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937800] Call Trace:
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937824] ? __delayacct_tsk_init+0x1e/0x40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937844] __delayacct_tsk_init+0x1e/0x40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937868] copy_process.part.35+0x6d3/0x1c00
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937887] ? __handle_mm_fault+0xa21/0xff0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937911] _do_fork+0xdf/0x400
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937931] ? __do_page_fault+0x2a1/0x4b0
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937951] ? get_unused_fd_flags+0x30/0x40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937971] SyS_clone+0x19/0x20
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.937990] do_syscall_64+0x73/0x130
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938009] entry_SYSCALL_64_after_hwframe+0x41/0xa6
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938025] RIP: 0033:0x7fe86a002b7c
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938036] RSP: 002b:00007fff26bfcc60 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938052] RAX: ffffffffffffffda RBX: 00007fff26bfcc60 RCX: 00007fe86a002b7c
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938063] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938075] RBP: 00007fff26bfccd0 R08: 00007fe86c207740 R09: 00007fe86a5cab40
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938086] R10: 00007fe86c207a10 R11: 0000000000000246 R12: 0000000000000000
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938098] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000001abacf8
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938113] Code: 50 08 65 4c 03 05 0f d5 1b 4d 49 83 78 10 00 4d 8b 20 0f 84 09 01 00 00 4d 85 e4 0f 84 00 01 00 00 49 63 47 20 49 8b 3f 4c 01 e0 <48> 8b 18 49 33 9f 40 01 00 00 48 89 c1 48 0f c9 4c 89 e0 48 31
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938259] RIP: kmem_cache_alloc+0x7a/0x1c0 RSP: ffffb7b6c1207d58
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938269] CR2: ffff93cdf5ef2eaa
Jul 27 12:35:09 xxxxxxx kernel: [ 5505.938284] ---[ end trace 5841e09627f12869 ]---
Jul 26 19:46:35 xxxxxxx kernel: [167923.077278] BUG: unable to handle kernel paging request at ffff994c94603766
Jul 26 19:46:35 xxxxxxx kernel: [167923.077295] IP: down_write+0x1f/0x40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077298] PGD a0599067 P4D a0599067 PUD 0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077304] Oops: 0002 [#2] SMP PTI
Jul 26 19:46:35 xxxxxxx kernel: [167923.077308] Modules linked in: ccm arc4 snd_hda_codec_hdmi iwlmvm snd_hda_codec_realtek snd_hda_codec_generic mac80211 intel_rapl intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp kvm_intel joydev kvm irqbypass punit_atom_debug intel_cstate iwlwifi snd_hda_intel snd_hda_codec ftdi_sio serio_raw hid_multitouch snd_hda_core lpc_ich cfg80211 input_leds mei_txe snd_hwdep snd_pcm usbserial btusb btrtl mei snd_timer snd cdc_acm soundcore shpchp hci_uart btbcm btqca btintel bluetooth rfkill_gpio pwm_lpss_platform pwm_lpss ecdh_generic mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Jul 26 19:46:35 xxxxxxx kernel: [167923.077360] raid0 multipath linear hid_generic usbhid i915 igb drm_kms_helper dca ahci i2c_algo_bit crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect sysimgblt ghash_clmulni_intel ptp cryptd fb_sys_fops psmouse pps_core libahci drm i2c_hid video hid
Jul 26 19:46:35 xxxxxxx kernel: [167923.077381] CPU: 2 PID: 22792 Comm: uwsgi Tainted: G B D W 4.15.0-151-generic #157-Ubuntu
Jul 26 19:46:35 xxxxxxx kernel: [167923.077384] Hardware name: Winmate Inc. IB3S/IB32S, BIOS V210 05/06/2019
Jul 26 19:46:35 xxxxxxx kernel: [167923.077389] RIP: 0010:down_write+0x1f/0x40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077392] RSP: 0018:ffffb4e7018cfd10 EFLAGS: 00010246
Jul 26 19:46:35 xxxxxxx kernel: [167923.077396] RAX: ffff994c94603766 RBX: ffff994c94603766 RCX: 0000000000027f57
Jul 26 19:46:35 xxxxxxx kernel: [167923.077398] RDX: ffffffff00000001 RSI: 0000000001000200 RDI: ffff994c94603766
Jul 26 19:46:35 xxxxxxx kernel: [167923.077401] RBP: ffffb4e7018cfd18 R08: ffffd4e6ffd292c0 R09: ffff996d60d7e4e0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077404] R10: 00007f220ffec000 R11: ffff996d70adde00 R12: ffff994c9460375e
Jul 26 19:46:35 xxxxxxx kernel: [167923.077407] R13: ffff996d54325ec0 R14: ffff994c9460375e R15: ffff996df104f000
Jul 26 19:46:35 xxxxxxx kernel: [167923.077410] FS: 00007f221338d740(0000) GS:ffff996dffd00000(0000) knlGS:0000000000000000
Jul 26 19:46:35 xxxxxxx kernel: [167923.077413] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 19:46:35 xxxxxxx kernel: [167923.077416] CR2: ffff994c94603766 CR3: 00000000943ba000 CR4: 00000000001006e0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077419] Call Trace:
Jul 26 19:46:35 xxxxxxx kernel: [167923.077428] anon_vma_clone+0x8f/0x1c0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077433] anon_vma_fork+0x32/0x130
Jul 26 19:46:35 xxxxxxx kernel: [167923.077440] copy_process.part.35+0xfe1/0x1c00
Jul 26 19:46:35 xxxxxxx kernel: [167923.077446] _do_fork+0xdf/0x400
Jul 26 19:46:35 xxxxxxx kernel: [167923.077454] ? __do_page_fault+0x2a1/0x4b0
Jul 26 19:46:35 xxxxxxx kernel: [167923.077460] ? get_unused_fd_flags+0x30/0x40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077465] SyS_clone+0x19/0x20
Jul 26 19:46:35 xxxxxxx kernel: [167923.077471] do_syscall_64+0x73/0x130
Jul 26 19:46:35 xxxxxxx kernel: [167923.077475] entry_SYSCALL_64_after_hwframe+0x41/0xa6
Jul 26 19:46:35 xxxxxxx kernel: [167923.077479] RIP: 0033:0x7f2211188b7c
Jul 26 19:46:35 xxxxxxx kernel: [167923.077482] RSP: 002b:00007fff81411ac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Jul 26 19:46:35 xxxxxxx kernel: [167923.077486] RAX: ffffffffffffffda RBX: 00007fff81411ac0 RCX: 00007f2211188b7c
Jul 26 19:46:35 xxxxxxx kernel: [167923.077488] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Jul 26 19:46:35 xxxxxxx kernel: [167923.077491] RBP: 00007fff81411b30 R08: 00007f221338d740 R09: 00007f2211750b40
Jul 26 19:46:35 xxxxxxx kernel: [167923.077494] R10: 00007f221338da10 R11: 0000000000000246 R12: 0000000000000000
Jul 26 19:46:35 xxxxxxx kernel: [167923.077497] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000001735cf8
Jul 26 19:46:35 xxxxxxx kernel: [167923.077500] Code: 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 9e d7 ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 73 b5 fe ff 65 48 8b 04 25 00 5c
Jul 26 19:46:35 xxxxxxx kernel: [167923.077534] RIP: down_write+0x1f/0x40 RSP: ffffb4e7018cfd10
Jul 26 19:46:35 xxxxxxx kernel: [167923.077537] CR2: ffff994c94603766
Jul 26 19:46:35 xxxxxxx kernel: [167923.077541] ---[ end trace 4d3c04fc4bbb2b33 ]---
如果需要的话我也可以发布其他内容。
我在启动时也经常看到这种情况:
[ FAILED ]Failed to start host name service
See systemctl status systemd-hostnamed.service for details
...
[ FAILED] Failed to start network name resolution
See systemctl status systemd-resolved.service for details
[ OK ]Stopped network name resolution
[ FAILED] Failed to start network name resolution
See systemctl status systemd-resolved.service for details
[ OK ]Stopped network name resolution
[ FAILED] Failed to start network name resolution
See systemctl status systemd-resolved.service for details
[ OK ]Stopped network name resolution
仅在过去 5 天内,我们就在全国各地看到了这种情况,所以我认为这与硬件或环境无关。我们已经有几周没有发布任何软件更新了(而且我们的一些客户也忽略了我们的软件更新)。
有人知道是什么原因导致这种情况以及如何预防吗?谢谢!
编辑 1:结果ls -la /boot
total 143024
drwxr-xr-x 3 root root 4096 Jul 23 06:35 .
drwxr-xr-x 24 root root 4096 Jul 22 06:57 ..
-rw-r--r-- 1 root root 217414 Jun 18 16:49 config-4.15.0-147-generic
-rw-r--r-- 1 root root 217414 Jul 9 20:19 config-4.15.0-151-generic
drwxr-xr-x 5 root root 4096 Jul 23 06:34 grub
-rw-r--r-- 1 root root 60458100 Jul 20 20:08 initrd.img-4.15.0-147-generic
-rw-r--r-- 1 root root 60462046 Jul 23 06:35 initrd.img-4.15.0-151-generic
-rw------- 1 root root 4082393 Jun 18 16:49 System.map-4.15.0-147-generic
-rw------- 1 root root 4082629 Jul 9 20:19 System.map-4.15.0-151-generic
-rw------- 1 root root 8449696 Jun 18 18:42 vmlinuz-4.15.0-147-generic
-rw------- 1 root root 8453792 Jul 9 20:23 vmlinuz-4.15.0-151-generic
结果free -h
total used free shared buff/cache available
Mem: 3.7G 165M 3.2G 6.7M 435M 3.4G
Swap: 0B 0B 0B
swapon -s
没有结果
结果sysctl vm.swappiness
vm.swappiness = 60
编辑2:
发现与 -151 内核有关的错误报告:https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1938013
我还拿出一个旧设备,在 4.15.0-142-generic 上对其进行了全面测试。然后我将其更新到 -151,并能够使用 nmcli 尝试 wifi 连接时引发错误。重新启动到 -142 后,我再也无法引发错误。我还需要对原始设备进行更多测试,完成后会发布。
答案1
我没有确定的确认,但我确实有相当多的观察证实这是 Ubunut 151 内核版本的结果。我能够在运行 151 时轻松重现该问题,但在降级到任何以前的版本后,我都无法重现。一个不幸的副作用是损坏的持续存在。内核崩溃本身并不是 RO 文件系统的直接原因。那是内核崩溃导致的 FS(孤立的 inode 等)损坏。这意味着即使回滚到以前的内核,对 FS 的损坏可能已经造成,导致单元在回滚后进入 RO。为了解决这个问题,在回滚内核后,我还在启动时启用了自动 fsck。几个月过去了,这个问题似乎已经解决了。感谢@heynnema 的帮助,让我与你交流想法!