问题
我在 RAID0 阵列中有 4 个 SSD。我最近注意到,当我在阵列挂载点上执行任何操作时,我都会遇到这些短暂的故障/暂停。例如,假设我正在使用 vim 编辑文件。我正在打字,但什么都没有显示出来,然后一堆字符会填满这一行(短暂冻结)。或者,我在 vim 中移动光标,我必须等待几秒钟才能让光标跟上。最后一个例子,我正在将文件 scp 到系统中,它完成文件写入后停留在 99% 上 5-10 分钟什么也不做。
mdstat 输出
Personalities : [raid0]
md127 : active raid0 sdc1[2] sdd1[3] sdb1[1] sda1[0]
3750236160 blocks super 1.2 512k chunks
unused devices: <none>
安装输出
/dev/md127 on /mnt/r0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=4096,noquota)
使用 dd 进行写入
所以我决定运行一下dd
,看看我的写入速度如何:
# dd if=/dev/zero of=foo.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 78.7336 s, 13.6 MB/s
在没有问题的服务器上(相同的驱动器、相同的配置):
# dd if=/dev/zero of=foo.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.17901 s, 174 MB/s
rclone 进程的 strace 输出
rclone
以下是附加到将文件从远程位置复制到阵列的进程的结束代码片段。我注意到,在所有挂起的进程中,我都看到一连串的<... futex resumed>) = ?
行:
[pid 17120] read(8, <unfinished ...>
[pid 17118] <... fadvise64 resumed>) = 0
[pid 17120] <... read resumed>"", 32768) = 0
[pid 17118] futex(0xc0000c4948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 17120] futex(0xc0000c4948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17118] <... futex resumed>) = 0
[pid 17120] fadvise64(8, 14665777152, 34827, POSIX_FADV_DONTNEED <unfinished ...>
[pid 17116] <... nanosleep resumed>NULL) = 0
[pid 17118] futex(0xc0000c4948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 17120] <... fadvise64 resumed>) = 0
[pid 17116] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 17120] futex(0xc0000c4948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17118] <... futex resumed>) = 0
[pid 17120] close(8 <unfinished ...>
[pid 17118] futex(0xc0000c4948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 17120] <... close resumed>) = 0
[pid 17116] <... nanosleep resumed>NULL) = 0
[pid 17120] futex(0xc0000c4948, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 17116] getpid( <unfinished ...>
[pid 17120] <... futex resumed>) = 1
[pid 17118] <... futex resumed>) = 0
[pid 17116] <... getpid resumed>) = 17115
[pid 17118] futex(0xc000ab3148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 17126] <... futex resumed>) = 0
[pid 17120] newfstatat(AT_FDCWD, "/mnt/r0/tmp/test/foo.img", <unfinished ...>
[pid 17118] <... futex resumed>) = 1
[pid 17126] futex(0xc000ab3148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 17116] tgkill(17115, 17120, SIGURG <unfinished ...>
[pid 17118] futex(0xc0000c4948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 17116] <... tgkill resumed>) = 0
[pid 17120] <... newfstatat resumed>0xc000b222a8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
[pid 17116] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 17120] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=17115, si_uid=0} ---
[pid 17120] rt_sigreturn({mask=[]}) = -1 ENOENT (No such file or directory)
[pid 17120] newfstatat(AT_FDCWD, "/mnt/r0/tmp/test", <unfinished ...>
[pid 17116] <... nanosleep resumed>NULL) = 0
[pid 17120] <... newfstatat resumed>{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
[pid 17116] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 17120] newfstatat(AT_FDCWD, "/mnt/r0/tmp/test/foo.img", 0xc000b22448, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
[pid 17120] renameat(AT_FDCWD, "/mnt/r0/tmp/test/foo.img.partial", AT_FDCWD, "foo.img" <unfinished ...>
[pid 17116] <... nanosleep resumed>NULL) = 0
[pid 17120] <... renameat resumed>) = 0
[pid 17116] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 17120] newfstatat(AT_FDCWD, "foo.img", {st_mode=S_IFREG|0644, st_size=14665811979, ...}, AT_SYMLINK_NOFOLLOW) = 0
[pid 17120] futex(0xc0000c4948, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 17118] <... futex resumed>) = 0
[pid 17116] <... nanosleep resumed>NULL) = 0
[pid 17118] futex(0xc000ab3148, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 17116] getpid( <unfinished ...>
[pid 17120] exit_group(0 <unfinished ...>
[pid 17118] <... futex resumed>) = 1
[pid 17120] <... exit_group resumed>) = ?
[pid 17144] <... futex resumed>) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 17115] <... futex resumed>) = ?
[pid 17149] <... futex resumed>) = ?
[pid 17145] <... futex resumed>) = ?
[pid 17143] <... futex resumed>) = ?
[pid 17130] <... futex resumed>) = ?
[pid 17128] <... futex resumed>) = ?
[pid 17127] <... futex resumed>) = ?
[pid 17126] <... futex resumed>) = ?
[pid 17125] <... futex resumed>) = ?
[pid 17124] <... futex resumed>) = ?
其他故障排除步骤
我已完成以下所有故障排除:
- 检查所有驱动器是否存在 SMART 错误:
smartctl -a /dev/<device>
- 无错误,无显示任何问题的属性(例如 Reallocated_Sector_Ct) - 检查是否存在任何潜在的内存问题:
cat /sys/devices/system/edac/mc/mc*/ue_count
并且cat /sys/devices/system/edac/mc/mc*/ce_count
都报告0。 - 检查了 CPU 负载,一切基本处于空闲状态。
- 观察
iostat
/iotop
输出以确定是否有其他内容正在写入,没有活动。 - 检查
dmesg
输出中是否存在任何内核错误,没有。
问题
就故障排除而言,这里是否缺少什么?还有什么可以做的吗?当驱动器速度很慢,但没有出现内核错误或 SMART 错误时,您是否只是更换驱动器并假设这是某种机械故障,还是?如果不是驱动器,除了 ECC 错误、CPU 使用率等之外,还可能涉及哪些其他因素?