从专用服务器拯救 Raid 1

从专用服务器拯救 Raid 1

我有一台专用服务器(Hetzner EX4)。有一天,我重新启动了服务器,但它无法启动。客服告诉我其中一个硬盘有故障,并启动了救援系统(Linux)。我的服务器有 2 个 3TB 硬盘,在 Raid1 上(几乎可以肯定!)

因此,我假设至少有一个驱动器可以使用,但我不知道如何从我的服务器中获取数据。我进行了一些研究,并尝试使用 partimage(和 partimage 服务器),但由于我不了解 Linux 中的磁盘和分区如何工作,所以我不知道如何从服务器中获取数据。

我甚至不知道我看到的是驱动器、分区还是什么!

不确定,但我认为这可能有用:

root@rescue /dev # cd dev/
-bash: cd: dev/: No such file or directory
root@rescue /dev # dir
1-1            cpu              full   loop4         mouse1              ptyp2  ptype  ram4    sda3  stderr  tty16  tty27  tty38  tty49  tty6   ttyp4  ttyS0    vcs2   vga_arbiter
2-1            cpu_dma_latency  fuse   loop5         net                 ptyp3  ptypf  ram5    sda4  stdin   tty17  tty28  tty39  tty5   tty60  ttyp5  ttyS1    vcs3   vhost-net
2-1.4          disk             hpet   loop6         network_latency     ptyp4  ram0   ram6    sda5  stdout  tty18  tty29  tty4   tty50  tty61  ttyp6  ttyS2    vcs4   watchdog
2-1.6          event0           input  loop7         network_throughput  ptyp5  ram1   ram7    sdb   tty     tty19  tty3   tty40  tty51  tty62  ttyp7  ttyS3    vcs5   watchdog0
autofs         event1           kmem   loop-control  null                ptyp6  ram10  ram8    sdb1  tty0    tty2   tty30  tty41  tty52  tty63  ttyp8  urandom  vcs6   xconsole
block          event2           kmsg   MAKEDEV       port                ptyp7  ram11  ram9    sdb2  tty1    tty20  tty31  tty42  tty53  tty7   ttyp9  usbmon0  vcsa   zero
bsg            event3           kvm    mapper        ppp                 ptyp8  ram12  random  sdb3  tty10   tty21  tty32  tty43  tty54  tty8   ttypa  usbmon1  vcsa1
btrfs-control  event4           log    md            psaux               ptyp9  ram13  rtc     sdb4  tty11   tty22  tty33  tty44  tty55  tty9   ttypb  usbmon2  vcsa2
bus            event5           loop0  mem           ptmx                ptypa  ram14  rtc0    sdb5  tty12   tty23  tty34  tty45  tty56  ttyp0  ttypc  usbmon3  vcsa3
char           event6           loop1  mice          pts                 ptypb  ram15  sda     sg0   tty13   tty24  tty35  tty46  tty57  ttyp1  ttypd  usbmon4  vcsa4
console        fb0              loop2  microcode     ptyp0               ptypc  ram2   sda1    sg1   tty14   tty25  tty36  tty47  tty58  ttyp2  ttype  vcs      vcsa5
core           fd               loop3  mouse0        ptyp1               ptypd  ram3   sda2    shm   tty15   tty26  tty37  tty48  tty59  ttyp3  ttypf  vcs1     vcsa6
root@rescue /dev # fdisk -l

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 3000.6 GB, 3000592982016 bytes
256 heads, 63 sectors/track, 363376 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x8ab49420

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sda: 3000.6 GB, 3000592982016 bytes
256 heads, 63 sectors/track, 363376 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.
root@rescue /dev #

有人能给我一些建议,或者指出解决问题的正确方向吗?也许我做错了,我应该采取另一种方法,或者这根本是不可能的 :/

更新 1 首先,我要感谢你们的所有建议。我尝试了一些方法,但不确定结果意味着什么。

首先,您已经看到了结果fdisk -l(我不确定 /dev/sda 具有磁盘标识符:0x00000000 是一个线索..

我尝试过挂载/dev/sda1,成功了。但是,如果我进入该目录,我只能看到 EFI 文件夹。这是正常的吗?

另外,如果我尝试挂载/dev/sdb1,我会得到“挂载:您必须指定文件系统类型”。

如果我运行cat /proc/mdstat我会得到这个:

Personalities : [raid1]
unused devices: <none>

更新2 按照 Cristian Ciupitu 的建议,我在两个驱动器上运行了 smartctl,结果如下:

数据库:

root@rescue / # smartctl -l error /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.10.36] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 242 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 242 occurred at disk power-on lifetime: 20101 hours (837 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 8c fe a2 0b  Error: UNC at LBA = 0x0ba2fe8c = 195231372

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 8c fe a2 4b 00   7d+19:32:38.593  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00   7d+19:32:38.559  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   7d+19:32:38.559  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+19:32:38.559  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+19:32:38.559  SET FEATURES [Set transfer mode]

Error 241 occurred at disk power-on lifetime: 20101 hours (837 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 8b fe a2 0b  Error: UNC at LBA = 0x0ba2fe8b = 195231371

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 8c fe a2 4b 00   7d+19:32:35.600  READ FPDMA QUEUED
  60 00 01 8b fe a2 4b 00   7d+19:32:35.600  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00   7d+19:32:35.567  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   7d+19:32:35.567  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+19:32:35.566  IDENTIFY DEVICE

Error 240 occurred at disk power-on lifetime: 20101 hours (837 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 8d fe a2 0b  Error: UNC at LBA = 0x0ba2fe8d = 195231373

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 8b fe a2 4b 00   7d+19:32:32.607  READ FPDMA QUEUED
  60 00 01 8c fe a2 4b 00   7d+19:32:32.606  READ FPDMA QUEUED
  60 00 01 8d fe a2 4b 00   7d+19:32:32.606  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00   7d+19:32:32.574  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00   7d+19:32:32.573  READ NATIVE MAX ADDRESS EXT

Error 239 occurred at disk power-on lifetime: 20101 hours (837 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 8a fe a2 0b  Error: UNC at LBA = 0x0ba2fe8a = 195231370

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 8d fe a2 4b 00   7d+19:32:29.563  READ FPDMA QUEUED
  60 00 01 8c fe a2 4b 00   7d+19:32:29.563  READ FPDMA QUEUED
  60 00 01 8b fe a2 4b 00   7d+19:32:29.563  READ FPDMA QUEUED
  60 00 01 8a fe a2 4b 00   7d+19:32:29.563  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00   7d+19:32:29.531  SET FEATURES [Reserved for Serial ATA]

Error 238 occurred at disk power-on lifetime: 20101 hours (837 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 8e fe a2 0b  Error: UNC at LBA = 0x0ba2fe8e = 195231374

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 8a fe a2 4b 00   7d+19:32:26.521  READ FPDMA QUEUED
  60 00 01 8b fe a2 4b 00   7d+19:32:26.521  READ FPDMA QUEUED
  60 00 01 8c fe a2 4b 00   7d+19:32:26.521  READ FPDMA QUEUED
  60 00 01 8d fe a2 4b 00   7d+19:32:26.521  READ FPDMA QUEUED
  60 00 01 8e fe a2 4b 00   7d+19:32:26.520  READ FPDMA QUEUED

sda:

root@rescue / # smartctl -t short /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.10.36] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

更新 3 我执行了 lsblk 来查找哪些分区包含数据:

root@rescue / # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   2.7T  0 disk
├─sda1   8:1    0   200M  0 part
├─sda2   8:2    0     1M  0 part
├─sda3   8:3    0   127M  0 part
├─sda4   8:4    0   2.7T  0 part /mnt
└─sda5   8:5    0 455.5K  0 part
sdb      8:16   0   2.7T  0 disk
├─sdb1   8:17   0     1M  0 part
├─sdb2   8:18   0   127M  0 part
├─sdb3   8:19   0   200M  0 part
├─sdb4   8:20   0   2.7T  0 part
└─sdb5   8:21   0 455.5K  0 part
loop0    7:0    0   1.5G  1 loop

然后我挂载了 sda4。我可以看到文件系统(Windows 中的 C 单元),但是当我进入目录(比如说“Program Files”)并尝试列出所有文件时,出现 IO 错误:

dir: reading directory .: Input/output error

我确实尝试使用 ncftpput 通过 ftp 发送所有文件系统,但大多数文件都引发了 IO 异常。

如果我尝试挂载 sdb4,我会收到此错误:

root@rescue / # mount /dev/sdb4 /mnt
ntfs_attr_pread_i: ntfs_pread failed: Input/output error
Failed to read vcn 0x28: Input/output error
Failed to mount '/dev/sdb4': Input/output error
NTFS is either inconsistent, or there is a hardware fault, or it's a
SoftRAID/FakeRAID hardware. In the first case run chkdsk /f on Windows
then reboot into Windows twice. The usage of the /f parameter is very
important! If the device is a SoftRAID/FakeRAID then first activate
it and mount a different device under the /dev/mapper/ directory, (e.g.
/dev/mapper/nvidia_eahaabcc1). Please see the 'dmraid' documentation
for more details.

更新 4

我尝试过 ntfsfix,但没有成功:

root@rescue / # ntfsfix /dev/sdb4
Mounting volume... ntfs_attr_pread_i: ntfs_pread failed: Input/output error
Failed to read vcn 0x28: Input/output error
FAILED
Attempting to correct errors...
Processing $MFT and $MFTMirr...
Reading $MFT... OK
Reading $MFTMirr... OK
Comparing $MFTMirr to $MFT... OK
Processing of $MFT and $MFTMirr completed successfully.
Setting required flags on partition... OK
Going to empty the journal ($LogFile)... OK
ntfs_attr_pread_i: ntfs_pread failed: Input/output error
Failed to read vcn 0x28: Input/output error
Remount failed: Input/output error

答案1

首先,检查您的备份。如果一切顺利,您不需要它们,但当你知道你的数据是安全的,并且你可以在不完全了解它们的情况下冒险时,它会有很大帮助(在情感上)。

接下来找出您使用的 RAID。它可以是硬件 RAID,但也可以是软件 RAID,如 mdadm。除非您记得为 HW RAID 付费,否则 mdadmn 是可能的。确认这一点并阅读 mdadm 手册页。

接下来找出哪个磁盘坏了,哪个磁盘还可以。

挂载没有问题的磁盘以获得降级的 RAID1。进行此检查cat /proc/mdstat。如果你很幸运,你将获得类似以下的输出:

猫/proc/mdstat
个性:[线性] [多路径] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 :活动 raid1 sda1[0] sdb1[2](F)
      24418688 块 [2/1] [U_]

在此示例中,mdadm 已加载并识别出第二个磁盘已发生故障。如果您没有收到此输出,请检查您的内核是否支持 mdadm,或者是否加载了正确的模块。可选择创建一个新的 md 设备。从内存开始,该命令为mdadm --create /dev/md0 --level=1 --raid-devices=2 missing /dev/sdb2。(检查一下!并意识到这假设第二个驱动器发生故障。如果是第一次写入,则创建一个缺少 sda 的驱动器!)。

获得一个可以正常工作的 /dev/mdX 设备后,请将所有数据复制到备用位置。您可能不需要它,但您需要确保您的备份是完全最新的。然后请您的提供商将损坏的磁盘换成新磁盘,并将新磁盘添加到 RAID 阵列。

注意事项:

  1. 这假设您有备份,或者可以阅读手册。
  2. 这假设磁盘被分区为单个大型 RAID1。这很有可能,但不能保证。(可以将磁盘分区为多个分区并将它们组成 RAID)。但是,最省力的方法是使用一个大型 RAID1 磁盘,由于工作量较少,因此大多数情况下将其用作默认设置。

答案2

您需要更换磁盘。现在。

对于 Linux,您的阵列将显示为单个驱动器,除非进行不同的分区或由控制器将其分成不同的虚拟磁盘。您应该能够使用串行控制台(如果主机提供)启动到服务器的控制器并查看其配置方式。

但是,要么更换磁盘,要么后悔。根据驱动器的使用年限,第二个驱动器可能即将出现故障,而您确实希望完成重建它失败。

答案3

您可能希望用好的硬盘替换坏的硬盘,它应该可以自行修复。但是,您的 RAID 实现可能有所不同,您可能需要查阅供应商的文档以获取进一步的指导。但是,如果文件系统被软件损坏,而不是坏的硬盘,则 RAID 无法防止这种情况(例如,两个硬盘上都发生了错误写入)。

答案4

您应该能够安装仍处于在线状态的硬盘。

尝试命令mount /dev/sda1 /mnt

如果此命令有效,它将挂载文件系统/mnt。导航到该文件夹​​,您应该能够看到文件和文件夹。只需备份您需要的信息并按照 Jonathan 的建议操作即可。

相关内容