md0 挂载经过一些写入后变为只读

2024-6-2 • tag-icon

因此，我得到了一个 RAID 6 阵列（受监控），并将其安装到我的文件系统（本例中为 /mnt/mounting_point）。经过一些写入，例如：

sudo fio --name=seqwrite --filename=/test/seqwrite.0.0 --rw=write --bs=1M --numjobs=8 --time_based --runtime=60 --size=10G --ioengine=libaio --direct=1

我只能从文件系统读取，上面的命令在 25% 之后会失败，然后它不再写入，在此之前它以 2000MB/s 的速度运行良好）。这不仅会在使用 fio 写入时发生，而且在我让它运行我的数据库服务器一段时间时也会发生。它最终会停止写入并且只能读取（因此查询将运行，INSERTS 将失败）。

它会很高兴地报告突袭工作正常：

           Version : 1.2
     Creation Time : Thu Aug 17 09:24:45 2023
        Raid Level : raid6
        Array Size : 7813772288 (7.28 TiB 8.00 TB)
     Used Dev Size : 3906886144 (3.64 TiB 4.00 TB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Oct  4 06:10:46 2023
             State : active 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : dh-n1:0
              UUID : 915e7624:bb97c5f3:b9441100:47906eac
            Events : 48569

    Number   Major   Minor   RaidDevice State
       0     259        3        0      active sync   /dev/nvme3n1
       1     259        1        1      active sync   /dev/nvme2n1
       4     259        2        2      active sync   /dev/nvme1n1
       3     259        0        3      active sync   /dev/nvme0n1

dmesg | grep md0仅显示：

[   15.248686] md/raid:md0: device nvme0n1 operational as raid disk 3
[   15.258967] md/raid:md0: device nvme2n1 operational as raid disk 1
[   15.271897] md/raid:md0: device nvme1n1 operational as raid disk 2
[   15.285967] md/raid:md0: device nvme3n1 operational as raid disk 0
[   15.353946] md/raid:md0: raid level 6 active with 4 out of 4 devices, algorithm 2
[   15.377809] md0: detected capacity change from 0 to 15627544576
[   34.665129] EXT4-fs (md0): mounted filesystem with ordered data mode. Quota mode: none.

fsck.ext4 -y /dev/md0不会失败，并说一切都好

当它失败时，ps aux将显示一半的进程，我必须使用 CTRL + C 取消它

还将top停止加载过程，我无法取消或关闭

rm /mnt/mounting_point/file.img将永远运行

nano /mnt/mounting_point/file.txt仅显示内容

dmesg 不会显示任何其他错误

/proc/命令行：

boot.img root=/dev/ram0 console=tty1 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all iommu.passthrough=1

top会这样说（并且会挂起）：

Mem: 32029768K used, 231099124K free, 1188620K shrd, 492976K buff, 16871968K cached
CPU:   0% usr   0% sys   0% nic  91% idle   7% io   0% irq   0% sirq
[And shows some non relevant processes]

Linux 版本 6.1.55

编辑：

/proc/mdstat在故障期间

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 nvme1n1[4] nvme0n1[3] nvme2n1[1] nvme3n1[0]
      7813772288 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 5/30 pages [20KB], 65536KB chunk

mount显示：

rootfs on / type rootfs (rw)
devtmpfs on /dev type devtmpfs (rw,relatime,size=131532284k,nr_inodes=32883071,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=52625756k,nr_inodes=819200,mode=755)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,nosuid,nodev,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=131564384k,nr_inodes=1048576)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
/dev/md0 on /mnt/k3s type ext4 (rw,relatime,stripe=256)
[...] removed the kubernetes mounts

永远有多长：

ps aux：等了 30 分钟，没有提供完整列表
fio完成：等了 30 分钟，但根本没完成
cat test.img：几秒钟内完成，这是一个 1GB 的文件，带有 /dev/null
rm test.img：等了 30 分钟，没有任何输出，有人有什么建议吗？
vi test.txt添加了单词测试和 wq，保存文件，但rm test.txt挂起（至少 30 分钟）

相关内容