RAID 1 阵列中大量数据丢失

RAID 1 阵列中大量数据丢失

我最近遇到了一次巨大的数据丢失(过去 5 个月的所有信息) 在我的一个磁盘崩溃后,即使 RAID 阵列仍然正常。

该磁盘是一个带有我的根分区(ext4)的 SSD,在 mdadm RAID 1 中镜像,并使用两个 HDD 的伪硬件 raid RAID 1。(见下文)

  [RAID_1]
      |
  +---+---+
  |       | 
[SSD]  [RAID_1]
          |
       +--+--+
       |     |
     [HDD] [HDD]

(write-mostly was used to speed up read IO)

这是一个奇怪的配置,但我不认为问题出在那里。

当 SSD 发生故障时,我注意到了这一点,因为服务器不再通过 ssh 响应,所以我插入了一个监视器,但出现了很多磁盘故障。我尝试重新启动服务器,但它没有再次启动,因为启动分区仅在 SSD 上(我的错...)。因此,我使用实时 USB 在根分区上安装启动目录,安装 RAID 1 阵列没有问题,使用 USB GRUB 密钥重新启动,一切似乎都正常。

但是,我很快就注意到一些文件处于非常旧的状态(约 2018 年 10 月 30 日,当时是 2019 年 4 月 10 日)。而且似乎所有根分区都已恢复到 10 月底。

日志文件证实了这一点:

# journalctl --list-boots
-19 8ee3d21dfdd8447b9b13d02b939f0a57 Tue 2018-07-24 15:45:02 CEST—Fri 2018-09-14 12:45:38 CEST
...
-10 8738ed87a849441dbeffd8571d9ebae5 Sun 2018-10-28 19:32:09 CET—Tue 2018-10-30 16:52:39 CET
 -9 b21291a3607b4b4ba42e8d99ec4b2b40 Wed 2019-04-10 17:27:55 CEST—Thu 2019-04-11 14:19:19 CEST
 -8 fc347339334a465c91d3807d2ca06ee0 Thu 2019-04-11 14:23:41 CEST—Thu 2019-04-11 14:39:21 CEST
 -7 a59cf07431844cecaefb58de81737957 Thu 2019-04-11 14:41:45 CEST—Thu 2019-04-11 15:36:56 CEST
...
  0 e29340c8edc44b79863634a790968a93 Thu 2019-05-02 18:05:29 CEST—Mon 2019-05-13 16:57:19 CEST

# journalctl -n 500000
...
oct. 30 16:52:31 kxkm-dev systemd[1]: Hardware watchdog 'INTCAMT', version 0
oct. 30 16:52:31 kxkm-dev systemd[1]: Set hardware watchdog to 10min.
oct. 30 16:52:39 kxkm-dev systemd-shutdown[1]: Sending SIGTERM to remaining processes...
oct. 30 16:52:39 kxkm-dev systemd-journal[323]: Journal stopped
-- Reboot --
avril 10 17:27:55 kxkm-dev systemd-journald[16674]: Missed 22960 kernel messages
avril 10 17:27:55 kxkm-dev kernel: usb 3-1: new low-speed USB device number 21 using xhci_hcd
avril 10 17:27:55 kxkm-dev kernel: usb 3-1: New USB device found, idVendor=0461, idProduct=4e22
avril 10 17:27:55 kxkm-dev kernel: usb 3-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
...
# journalctl -b -9 --dmesg 
-- Logs begin at Tue 2018-07-24 15:45:02 CEST, end at Mon 2019-05-13 18:55:21 CEST. --
avril 10 17:27:55 kxkm-dev systemd-journald[16674]: Missed 22960 kernel messages
avril 10 17:27:55 kxkm-dev kernel: usb 3-1: new low-speed USB device number 21 using xhci_hcd
avril 10 17:27:55 kxkm-dev kernel: usb 3-1: New USB device found, idVendor=0461, idProduct=4e22
avril 10 17:27:55 kxkm-dev kernel: usb 3-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel:  md126:
avril 10 17:27:55 kxkm-dev kernel: md: md126 does not have a valid v1.2 superblock, not importing!
avril 10 17:27:55 kxkm-dev kernel:  md126:
avril 10 17:27:55 kxkm-dev kernel: md: md_import_device returned -22
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel: md: md126 does not have a valid v1.2 superblock, not importing!
avril 10 17:27:55 kxkm-dev kernel: md: md_import_device returned -22
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel: md: md126 does not have a valid v1.2 superblock, not importing!
avril 10 17:27:55 kxkm-dev kernel: md: md_import_device returned -22
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel: md: md126 does not have a valid v1.2 superblock, not importing!
avril 10 17:27:55 kxkm-dev kernel: md: md_import_device returned -22
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel: md: md126 does not have a valid v1.2 superblock, not importing!
avril 10 17:27:55 kxkm-dev kernel: md: md_import_device returned -22
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel: md: md0 stopped.
avril 10 17:27:55 kxkm-dev kernel:  md126:
avril 10 17:27:55 kxkm-dev kernel: md: bind<md126>
avril 10 17:27:55 kxkm-dev kernel:  md126:
avril 10 17:27:55 kxkm-dev kernel: md/raid1:md0: active with 1 out of 1 mirrors
avril 10 17:27:55 kxkm-dev kernel: created bitmap (2 pages) for device md0
avril 10 17:27:55 kxkm-dev kernel: md0: bitmap initialized from disk: read 1 pages, set 81 of 3319 bits
avril 10 17:27:55 kxkm-dev kernel: md0: detected capacity change from 0 to 222722785280
avril 10 17:27:55 kxkm-dev kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
avril 10 17:27:55 kxkm-dev kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
avril 10 17:27:55 kxkm-dev systemd[1]: systemd 232 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
avril 10 17:27:55 kxkm-dev systemd[1]: Detected architecture x86-64.
avril 10 17:27:55 kxkm-dev systemd[1]: Set hostname to <kxkm-dev>.
avril 10 17:27:55 kxkm-dev systemd[1]: [/etc/systemd/system/docker.service.d/execWithDeviceMapper.conf:4] Missing '='.
avril 10 17:27:55 kxkm-dev systemd[1]: Listening on Journal Socket.
avril 10 17:27:55 kxkm-dev systemd[1]: Listening on LVM2 poll daemon socket.
avril 10 17:27:55 kxkm-dev systemd[1]: Listening on Device-mapper event daemon FIFOs.
avril 10 17:27:55 kxkm-dev systemd[1]: Created slice User and Session Slice.
avril 10 17:27:55 kxkm-dev systemd[1]: Listening on fsck to fsckd communication Socket.
avril 10 17:27:55 kxkm-dev systemd[1]: Listening on RPCbind Server Activation Socket.
avril 10 17:27:55 kxkm-dev kernel: EXT4-fs (md0): re-mounted. Opts: errors=remount-ro
avril 10 17:27:56 kxkm-dev kernel: RPC: Registered named UNIX socket transport module.
avril 10 17:27:56 kxkm-dev kernel: RPC: Registered udp transport module.
avril 10 17:27:56 kxkm-dev kernel: RPC: Registered tcp transport module.
avril 10 17:27:56 kxkm-dev kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
avril 10 17:27:56 kxkm-dev kernel: lp: driver loaded but no devices found
avril 10 17:28:06 kxkm-dev kernel: ppdev: user-space parallel port driver
avril 10 17:28:06 kxkm-dev kernel: parport_pc 00:05: reported by Plug and Play ACPI
avril 10 17:28:06 kxkm-dev kernel: parport0: PC-style at 0x378 (0x778), irq 5 [PCSPP,TRISTATE,EPP]
avril 10 17:28:06 kxkm-dev kernel: lp0: using parport0 (interrupt-driven).
avril 10 17:28:06 kxkm-dev kernel: Loading iSCSI transport class v2.0-870.
avril 10 17:28:06 kxkm-dev kernel: iscsi: registered transport (tcp)
avril 10 17:28:06 kxkm-dev kernel: iscsi: registered transport (iser)
avril 10 17:28:06 kxkm-dev kernel: shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
avril 10 17:28:06 kxkm-dev kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
avril 10 17:28:06 kxkm-dev kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0
avril 10 17:28:06 kxkm-dev kernel: sd 6:0:0:0: Attached scsi generic sg2 type 0
avril 10 17:28:06 kxkm-dev kernel: sd 7:0:0:0: Attached scsi generic sg3 type 0
avril 10 17:28:06 kxkm-dev kernel: RAPL PMU: API unit is 2^-32 Joules, 4 fixed counters, 655360 ms ovfl timer
avril 10 17:28:06 kxkm-dev kernel: RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
avril 10 17:28:06 kxkm-dev kernel: RAPL PMU: hw unit of domain package 2^-14 Joules
avril 10 17:28:06 kxkm-dev kernel: RAPL PMU: hw unit of domain dram 2^-14 Joules
avril 10 17:28:06 kxkm-dev kernel: RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
avril 10 17:28:06 kxkm-dev kernel: snd_hda_intel 0000:00:03.0: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0: autoconfig for ALC892: line_outs=3 (0x14/0x15/0x16/0x0/0x0) type:line
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:    speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:    hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:    mono: mono_out=0x0
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:    dig-out=0x1e/0x0
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:    inputs:
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:      Front Mic=0x19
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:      Rear Mic=0x18
avril 10 17:28:06 kxkm-dev kernel: snd_hda_codec_realtek hdaudioC1D0:      Line=0x1a
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel HDMI HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.0/sound/card0/input2816
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel HDMI HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.0/sound/card0/input2817
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel HDMI HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/sound/card0/input2818
avril 10 17:28:06 kxkm-dev kernel: input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:1b.0/sound/card1/input2815
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel PCH Rear Mic as /devices/pci0000:00/0000:00:1b.0/sound/card1/input2819
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel PCH Line as /devices/pci0000:00/0000:00:1b.0/sound/card1/input2820
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel PCH Line Out Front as /devices/pci0000:00/0000:00:1b.0/sound/card1/input2821
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel PCH Line Out Surround as /devices/pci0000:00/0000:00:1b.0/sound/card1/input2822
avril 10 17:28:06 kxkm-dev kernel: input: HDA Intel PCH Line Out CLFE as /devices/pci0000:00/0000:00:1b.0/sound/card1/input2823
avril 10 17:28:06 kxkm-dev kernel: input: PC Speaker as /devices/platform/pcspkr/input/input2824
avril 10 17:28:06 kxkm-dev kernel: intel_rapl: Found RAPL domain package
avril 10 17:28:06 kxkm-dev kernel: intel_rapl: Found RAPL domain core
avril 10 17:28:06 kxkm-dev kernel: intel_rapl: Found RAPL domain uncore
avril 10 17:28:06 kxkm-dev kernel: intel_rapl: Found RAPL domain dram
avril 10 17:28:06 kxkm-dev kernel: iTCO_vendor_support: vendor-support=0
avril 10 17:28:06 kxkm-dev kernel: device-mapper: table: 253:0: mirror: Device lookup failure
avril 10 17:28:06 kxkm-dev kernel: device-mapper: ioctl: error adding target to table
avril 10 17:28:06 kxkm-dev kernel: iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
avril 10 17:28:06 kxkm-dev kernel: iTCO_wdt: Found a 9 Series TCO device (Version=2, TCOBASE=0x1860)
avril 10 17:28:06 kxkm-dev kernel: iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
avril 10 17:28:06 kxkm-dev kernel: device-mapper: table: 253:0: mirror: Device lookup failure
avril 10 17:28:06 kxkm-dev kernel: device-mapper: ioctl: error adding target to table
avril 10 17:28:06 kxkm-dev kernel: device-mapper: table: 253:1: mirror: Device lookup failure
avril 10 17:28:06 kxkm-dev kernel: device-mapper: ioctl: error adding target to table
avril 10 17:28:06 kxkm-dev kernel: device-mapper: table: 253:0: mirror: Device lookup failure
avril 10 17:28:06 kxkm-dev kernel: device-mapper: ioctl: error adding target to table
avril 10 17:28:06 kxkm-dev kernel: EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null)
avril 10 17:28:06 kxkm-dev kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
avril 10 17:28:06 kxkm-dev kernel: IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
avril 10 17:28:07 kxkm-dev kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
avril 10 17:28:07 kxkm-dev kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
avril 10 17:28:14 kxkm-dev kernel: fuse init (API version 7.26)
avril 10 17:28:14 kxkm-dev kernel: FS-Cache: Loaded
avril 10 17:28:14 kxkm-dev kernel: Key type dns_resolver registered
avril 10 17:28:14 kxkm-dev kernel: FS-Cache: Netfs 'cifs' registered for caching
avril 10 17:28:14 kxkm-dev kernel: Key type cifs.spnego registered
avril 10 17:28:14 kxkm-dev kernel: Key type cifs.idmap registered
avril 10 17:28:20 kxkm-dev kernel: CIFS VFS: Error connecting to socket. Aborting operation.
avril 10 17:28:20 kxkm-dev kernel: CIFS VFS: Error connecting to socket. Aborting operation.
avril 10 17:28:20 kxkm-dev kernel: CIFS VFS: cifs_mount failed w/return code = -113
avril 10 17:28:20 kxkm-dev kernel: CIFS VFS: cifs_mount failed w/return code = -113
avril 10 17:29:27 kxkm-dev kernel: Netfilter messages via NETLINK v0.30.
avril 10 17:29:27 kxkm-dev kernel: nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
avril 10 17:29:27 kxkm-dev kernel: ctnetlink v0.93: registering with nfnetlink.
avril 10 17:29:28 kxkm-dev kernel: tun: Universal TUN/TAP device driver, 1.6
avril 10 17:29:28 kxkm-dev kernel: tun: (C) 1999-2004 Max Krasnyansky <[email protected]>
avril 10 17:29:31 kxkm-dev kernel: aufs: loading out-of-tree module taints kernel.
avril 10 17:29:31 kxkm-dev kernel: aufs 4.9-20161219
avril 10 17:29:32 kxkm-dev kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
avril 10 17:29:32 kxkm-dev kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
avril 10 17:29:37 kxkm-dev kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
avril 10 17:29:37 kxkm-dev kernel: Ebtables v2.0 registered
avril 10 17:29:51 kxkm-dev kernel: bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
avril 10 17:29:51 kxkm-dev kernel: Bridge firewalling registered
avril 10 17:29:51 kxkm-dev kernel: Initializing XFRM netlink socket

我的 nginx 日志也是一样

# zmore error.log.3.gz
2018/10/28 19:27:46 [error] 2570#0: *169 connect() failed (111: Connection refused) while connecting to upstream, client: 10.2.0.242, server: , request: "GET /node/metrics HTTP/1.0", upstream: "http://[::1]:8000/metrics", host: "kxkm-dev"
2018/10/30 16:46:31 [notice] 30592#30592: using inherited sockets from "8;9;10;11;12;13;"
2019/04/10 17:30:02 [emerg] 19588#19588: BIO_new_file("/var/lib/acme/live/***/fullchain") failed (SSL: error:02001002:system library:fopen:No such file or directory:fopen('/var/lib/acme/live/***/fullchain','r') error:2006D080:BIO routines:BIO_new_file:no such file)

以下是 mdadm 管理的根分区的状态:

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Mon Apr 18 14:22:03 2016
     Raid Level : raid1
     Array Size : 217502720 (207.43 GiB 222.72 GB)
  Used Dev Size : 217502720 (207.43 GiB 222.72 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 13 18:42:23 2019
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : debian:0
           UUID : 57cd1ad8:b3979a16:2900c1b1:7dfa7649
         Events : 16011

    Number   Major   Minor   RaidDevice State
       1       9      126        0      active sync writemostly   /dev/md/System
       3       8       34        1      active sync   /dev/sdc2

# mdadm -D /dev/md/System
/dev/md/System:
      Container : /dev/md/imsm0, member 1
     Raid Level : raid1
     Array Size : 254816256 (243.01 GiB 260.93 GB)
  Used Dev Size : 254816388 (243.01 GiB 260.93 GB)
   Raid Devices : 2
  Total Devices : 2

          State : active 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0


           UUID : 7bf5b497:78188413:e395ae40:aa31e7b9
    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

好的,所以这可能是文件系统问题,ext4 试图恢复可用状态时会擦除大量数据,但这似乎太大了,不是吗?但另一方面,我看不出任何其他解释,RAID 1 阵列重建进展顺利,而且我没有注意到 RAID 配置有任何重大问题。

有任何想法吗 ?

PS:它主要是开发服务器,所以没有太多备份,而且我知道 RAID 不是备份,但我仍然试图了解可能发生了什么。

答案1

如果您的顶层对实际上是镜像,那么所有磁盘都将相同,这意味着引导扇区将同时存在于 SSD 和两个 HDD 上。您的引导扇区缺失的线索非常重要。在镜像中设置 3 个磁盘的正确方法是全部位于顶层。除非您想使用不同的 RAID 级别(如 1+0 或 0+1),否则制作一个集合然后将生成的磁盘用作另一个集合的成员没有任何好处。您毫无理由地添加了两个级别的 RAID。我必须假设您在尝试嵌套 RAID 级别时犯了一个错误。HDD 保存着您破坏阵列之前存在的旧数据。破坏阵列后,您一定是在使用 SSD 工作。

如果您想使用软件 RAID,则应该重新开始并使用 ZFS。

答案2

无需深入研究您的帖子,我就能看出“假 raid”和“ssd raid”这两个词都是错误。假 raid 必然会导致故障。据我所知,SSD raid 不允许修剪。

相关内容