因此,我得到了一个 RAID 6 阵列(受监控),并将其安装到我的文件系统(本例中为 /mnt/mounting_point)。经过一些写入,例如:
sudo fio --name=seqwrite --filename=/test/seqwrite.0.0 --rw=write --bs=1M --numjobs=8 --time_based --runtime=60 --size=10G --ioengine=libaio --direct=1
我只能从文件系统读取,上面的命令在 25% 之后会失败,然后它不再写入,在此之前它以 2000MB/s 的速度运行良好)。这不仅会在使用 fio 写入时发生,而且在我让它运行我的数据库服务器一段时间时也会发生。它最终会停止写入并且只能读取(因此查询将运行,INSERTS 将失败)。
它会很高兴地报告突袭工作正常:
Version : 1.2
Creation Time : Thu Aug 17 09:24:45 2023
Raid Level : raid6
Array Size : 7813772288 (7.28 TiB 8.00 TB)
Used Dev Size : 3906886144 (3.64 TiB 4.00 TB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Oct 4 06:10:46 2023
State : active
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : dh-n1:0
UUID : 915e7624:bb97c5f3:b9441100:47906eac
Events : 48569
Number Major Minor RaidDevice State
0 259 3 0 active sync /dev/nvme3n1
1 259 1 1 active sync /dev/nvme2n1
4 259 2 2 active sync /dev/nvme1n1
3 259 0 3 active sync /dev/nvme0n1
dmesg | grep md0
仅显示:
[ 15.248686] md/raid:md0: device nvme0n1 operational as raid disk 3
[ 15.258967] md/raid:md0: device nvme2n1 operational as raid disk 1
[ 15.271897] md/raid:md0: device nvme1n1 operational as raid disk 2
[ 15.285967] md/raid:md0: device nvme3n1 operational as raid disk 0
[ 15.353946] md/raid:md0: raid level 6 active with 4 out of 4 devices, algorithm 2
[ 15.377809] md0: detected capacity change from 0 to 15627544576
[ 34.665129] EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: none.
fsck.ext4 -y /dev/md0
不会失败,并说一切都好
当它失败时,ps aux
将显示一半的进程,我必须使用 CTRL + C 取消它
还将top
停止加载过程,我无法取消或关闭
rm /mnt/mounting_point/file.img
将永远运行
nano /mnt/mounting_point/file.txt
仅显示内容
dmesg 不会显示任何其他错误
/proc/命令行:
boot.img root=/dev/ram0 console=tty1 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all iommu.passthrough=1
top
会这样说(并且会挂起):
Mem: 32029768K used, 231099124K free, 1188620K shrd, 492976K buff, 16871968K cached
CPU: 0% usr 0% sys 0% nic 91% idle 7% io 0% irq 0% sirq
[And shows some non relevant processes]
Linux 版本 6.1.55
编辑:
/proc/mdstat
在故障期间
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 nvme1n1[4] nvme0n1[3] nvme2n1[1] nvme3n1[0]
7813772288 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 5/30 pages [20KB], 65536KB chunk
mount
显示:
rootfs on / type rootfs (rw)
devtmpfs on /dev type devtmpfs (rw,relatime,size=131532284k,nr_inodes=32883071,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=52625756k,nr_inodes=819200,mode=755)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,nosuid,nodev,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=131564384k,nr_inodes=1048576)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
/dev/md0 on /mnt/k3s type ext4 (rw,relatime,stripe=256)
[...] removed the kubernetes mounts
永远有多长:
ps aux
:等了 30 分钟,没有提供完整列表fio
完成:等了 30 分钟,但根本没完成cat test.img
:几秒钟内完成,这是一个 1GB 的文件,带有 /dev/nullrm test.img
:等了 30 分钟,没有任何输出,有人有什么建议吗?vi test.txt
添加了单词测试和 wq,保存文件,但rm test.txt
挂起(至少 30 分钟)