概述

概述

概述

我们目前正在对具有 12 个 RAID-Z1 vdev 的 ZFS 池进行清理,每个 vdev 有 12 个驱动器。每个 vdev 对应一个机箱。硬件是 Dell PowerEdge 730xd,带有两个 Dell 12Gbps SAS(LSI SAS3008)控制器和 12 个 Dell MD1400 机箱。操作系统是 CentOS 7.6.1810。

我们无法成功清理池,因为一段时间后驱动器会变成FAULTEDZFS,我们必须zpool clear继续。变成的驱动器FAULTED似乎是随机的,并且smartctl表示它们的 SMART 状态正常。

唯一的共同点是,在驱动器标记为之前FAULTED,错误消息mpt3sas_scsih_issue_tm: timeout会出现dmesg,然后控制器重置,并出现大量的 ZED 错误和读取错误。

我目前陷入以下困境:

  • 这是软件还是硬件问题?
  • 如果是软件,是否有配置更改或补丁可以防止错误?
  • 如果是硬件问题,我该如何缩小问题范围?

我们尝试过的方法

我们尝试了以下方法:

  • 增加每个磁盘的超时值/sys/block/*/device/timeout
  • 更换所有 SAS 电缆
  • 升级所有固件
  • FAULTED在磁盘上运行 SMART 后台长测试
  • 重启(迄今 3 次)

我也看了这个答案但没有帮助。

细节

journalctl活动开始时的情况如下:

Apr 12 04:42:07 kernel: sd 5:0:18:0: attempting task abort! scmd(ffff8d36c295a4c0)
Apr 12 04:42:07 kernel: sd 5:0:4:0: attempting task abort! scmd(ffff8d3745b20540)
Apr 12 04:42:07 kernel: sd 5:0:4:0: [sdac] CDB: Read(32)
Apr 12 04:42:07 kernel: sd 5:0:4:0: [sdac] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Apr 12 04:42:07 kernel: sd 5:0:4:0: [sdac] CDB[10]: 60 2a b8 c8 60 2a b8 c8 00 00 00 00 00 00 00 08
Apr 12 04:42:07 kernel: scsi target5:0:4: handle(0x000e), sas_address(0x5000c500a6bb846e), phy(4)
Apr 12 04:42:07 kernel: scsi target5:0:4: enclosure logical id(0x5204747299f56500), slot(4) 
Apr 12 04:42:07 kernel: scsi target5:0:4: enclosure level(0x0000), connector name( 1   )
Apr 12 04:42:07 kernel: sd 5:0:18:0: [sdap] CDB: Read(32)
Apr 12 04:42:07 kernel: sd 5:0:18:0: [sdap] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Apr 12 04:42:07 kernel: sd 5:0:18:0: [sdap] CDB[10]: 60 2b f7 f8 60 2b f7 f8 00 00 00 00 00 00 00 08
Apr 12 04:42:07 kernel: scsi target5:0:18: handle(0x001d), sas_address(0x5000c500a6bb68ce), phy(5)
Apr 12 04:42:07 kernel: scsi target5:0:18: enclosure logical id(0x5204747299f5dd00), slot(0) 
Apr 12 04:42:07 kernel: scsi target5:0:18: enclosure level(0x0001), connector name( 1   )
Apr 12 04:42:37 kernel: mpt3sas_cm1: mpt3sas_scsih_issue_tm: timeout
Apr 12 04:42:37 kernel: mf:

Apr 12 04:42:37 kernel: 0100000e 
Apr 12 04:42:37 kernel: 00000100 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 

Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 00000000 
Apr 12 04:42:37 kernel: 000000b6 
Apr 12 04:42:37 kernel: 
Apr 12 04:42:47 kernel: mpt3sas_cm1: sending diag reset !!
Apr 12 04:42:48 kernel: mpt3sas_cm1: diag reset: SUCCESS
Apr 12 04:42:48 kernel: mpt3sas_cm1: LSISAS3008: FWVersion(16.00.04.00), ChipRevision(0x02), BiosVersion(18.00.00.00)
Apr 12 04:42:48 kernel: mpt3sas_cm1: Protocol=(
Apr 12 04:42:48 kernel: Initiator
Apr 12 04:42:48 kernel: ,Target
Apr 12 04:42:48 kernel: ), 
Apr 12 04:42:48 kernel: Capabilities=(
Apr 12 04:42:48 kernel: TLR
Apr 12 04:42:48 kernel: ,EEDP
Apr 12 04:42:48 kernel: ,Snapshot Buffer
Apr 12 04:42:48 kernel: ,Diag Trace Buffer
Apr 12 04:42:48 kernel: ,Task Set Full
Apr 12 04:42:48 kernel: ,NCQ
Apr 12 04:42:48 kernel: )
Apr 12 04:42:48 kernel: mpt3sas_cm1: sending port enable !!
Apr 12 04:42:55 kernel: mpt3sas_cm1: port enable: SUCCESS
Apr 12 04:42:55 kernel: mpt3sas_cm1: search for end-devices: start
Apr 12 04:42:55 kernel: scsi target5:0:0: handle(0x000a), sas_addr(0x5000c500a6bc5ef6)
Apr 12 04:42:55 kernel: scsi target5:0:0: enclosure logical id(0x5204747299f56500), slot(9)
Apr 12 04:42:55 kernel: scsi target5:0:1: handle(0x000b), sas_addr(0x5000c500a6bc6e66)
Apr 12 04:42:55 kernel: scsi target5:0:1: enclosure logical id(0x5204747299f56500), slot(5)
Apr 12 04:42:55 kernel: scsi target5:0:2: handle(0x000c), sas_addr(0x5000c500a6bbd86e)
Apr 12 04:42:55 kernel: scsi target5:0:2: enclosure logical id(0x5204747299f56500), slot(1)

对于连接到控制器的每个驱动器,都会重复handle和行。enclosure

接下来是:

Apr 12 04:42:57 kernel: mpt3sas_cm1: search for end-devices: complete
Apr 12 04:42:57 kernel: mpt3sas_cm1: search for expanders: start
Apr 12 04:42:57 kernel:         expander present: handle(0x0009), sas_addr(0x5204747299f565ff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0016), sas_addr(0x5204747299f5ddff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0024), sas_addr(0x520474729a0a68ff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0032), sas_addr(0x520474729a0b61ff)
Apr 12 04:42:57 kernel:         expander present: handle(0x0040), sas_addr(0x520474729a09f1ff)
Apr 12 04:42:57 kernel: mpt3sas_cm1: search for expanders: complete
Apr 12 04:42:57 kernel: sd 5:0:4:0: task abort: SUCCESS scmd(ffff8d3745b20540)
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: start
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: end-devices
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: expanders
Apr 12 04:42:57 kernel: mpt3sas_cm1: removing unresponding devices: complete
Apr 12 04:42:57 kernel: mpt3sas_cm1: scan devices: start
Apr 12 04:42:57 kernel: sd 5:0:18:0: task abort: SUCCESS scmd(ffff8d36c295a4c0)
Apr 12 04:42:57 kernel: scsi_io_completion: 13 callbacks suppressed
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] CDB: Read(32)
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 00
Apr 12 04:42:57 kernel: sd 5:0:18:0: [sdap] CDB[10]: 60 2b f7 f8 60 2b f7 f8 00 00 00 00 00 00 00 08
Apr 12 04:42:57 kernel: blk_update_request: 13 callbacks suppressed
Apr 12 04:42:57 kernel: blk_update_request: I/O error, dev sdap, sector 1613494264
Apr 12 04:42:57 kernel: sd 5:0:21:0: attempting task abort! scmd(ffff8d3acfef0540)
Apr 12 04:42:57 kernel: sd 5:0:21:0: [sdas] CDB: Read(32)
Apr 12 04:42:57 kernel: sd 5:0:21:0: [sdas] CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 03
Apr 12 04:42:57 kernel: sd 5:0:21:0: [sdas] CDB[10]: 01 af 8c b0 01 af 8c b0 00 00 00 00 00 00 00 08
Apr 12 04:42:57 kernel: scsi target5:0:21: handle(0x0020), sas_address(0x5000c500a6bc5f82), phy(8)

以及更多的读取超时。然后,我们看到很多zed错误:

Apr 12 04:42:57 zed[137074]: eid=2425 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137076]: eid=2426 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137078]: eid=2427 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137080]: eid=2428 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc59bb-part1
Apr 12 04:42:57 zed[137082]: eid=2429 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137084]: eid=2430 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137086]: eid=2431 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137088]: eid=2432 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc4337-part1
Apr 12 04:42:57 zed[137090]: eid=2433 class=io pool_guid=0x3317CEBDDE480DA0
Apr 12 04:42:57 zed[137092]: eid=2434 class=io pool_guid=0x3317CEBDDE480DA0
Apr 12 04:42:57 zed[137094]: eid=2435 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137096]: eid=2436 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137098]: eid=2437 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137100]: eid=2438 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bc5f83-part1
Apr 12 04:42:57 zed[137102]: eid=2439 class=delay pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bb68cf-part1
Apr 12 04:42:57 zed[137104]: eid=2440 class=io pool_guid=0x3317CEBDDE480DA0 vdev_path=/dev/disk/by-id/scsi-35000c500a6bb68cf-part1

此后,驱动器被标记为“降级”或“故障”。我还会提供一些可能有用的信息。

zpool status以下是两个带有设备的 vdev的输出FAULTED

    raidz1-4                                         DEGRADED     0     0     0
      scsi-35000cca2513f78b8                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25157bfd0                         ONLINE       0     0     0  (repairing)
      scsi-35000cca251597aa4                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2515de7b0                         FAULTED      0     0     0  too many errors
      scsi-35000cca2516278c8                         DEGRADED     0     0     0  too many errors
      scsi-35000cca25163ea64                         ONLINE       0     0     0  (repairing)
      scsi-35000cca251644664                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2516576a0                         DEGRADED     0     0     0  too many errors
      scsi-35000cca251699f68                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25169bd10                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25169be5c                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca25169c09c                         DEGRADED     0     0     0  too many errors  (repairing)
    raidz1-5                                         DEGRADED     0     0     0
      scsi-35000cca2516bc234                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2516bc26c                         ONLINE       0     0     0
      scsi-35000cca2516c8e78                         ONLINE       0     0     0
      scsi-35000cca2516ca244                         ONLINE       0     0     0
      scsi-35000cca2516ca334                         ONLINE       0     0     0  (repairing)
      scsi-35000cca2516ca848                         ONLINE       0     0     0  (repairing)
      scsi-35000cca2516cb3e0                         ONLINE       0     0     0  (repairing)
      scsi-35000cca2516cb420                         DEGRADED     0     0     0  too many errors  (repairing)
      scsi-35000cca2516cc210                         ONLINE       0     0     0
      scsi-35000cca2516ce390                         FAULTED      0     0     0  too many errors  (repairing)
      scsi-35000cca2516ce8e4                         ONLINE       0     0     0
      scsi-35000cca2516cf224                         ONLINE       0     0     0

smartctl -a以下是驱动FAULTED器的输出raidz1-4

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL5200
Revision:             LS15
Compliance:           SPC-4
User Capacity:        9,796,820,402,176 bytes [9.79 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2515de7b0
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Apr 12 13:40:57 2019 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        50 C

Manufactured in week 02 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  5
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  889
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 30677043943309312

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       40         0       294   10394513     118610.223           0
write:         0        0         0         0     239773      43528.082           0
verify:        0        0         0         0      18403        101.563           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  96   18243                 - [-   -    -]
# 2  Background short  Completed                  96   16753                 - [-   -    -]
# 3  Reserved(7)       Completed                  64       2                 - [-   -    -]

Long (extended) Self Test duration: 64033 seconds [1067.2 minutes]

sysctl -a | grep -v 'net.' | grep -v 'kernel.sched_domain.'

abi.vsyscall32 = 1
crypto.fips_enabled = 0
debug.exception-trace = 1
debug.kprobes-optimization = 1
debug.panic_on_rcu_stall = 0
dev.hpet.max-user-freq = 64
dev.mac_hid.mouse_button2_keycode = 97
dev.mac_hid.mouse_button3_keycode = 100
dev.mac_hid.mouse_button_emulation = 0
dev.raid.speed_limit_max = 200000
dev.raid.speed_limit_min = 1000
dev.scsi.logging_level = 0
fs.aio-max-nr = 65536
fs.aio-nr = 0
fs.binfmt_misc.status = enabled
fs.dentry-state = 235028  190450  45  0 0 0
fs.dir-notify-enable = 1
fs.epoll.max_user_watches = 108185722
fs.file-max = 52384239
fs.file-nr = 2080 0 52384239
fs.inode-nr = 102807  662
fs.inode-state = 102807 662 0 0 0 0 0
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192
fs.lease-break-time = 45
fs.leases-enable = 1
fs.may_detach_mounts = 0
fs.mount-max = 100000
fs.mqueue.msg_default = 10
fs.mqueue.msg_max = 10
fs.mqueue.msgsize_default = 8192
fs.mqueue.msgsize_max = 8192
fs.mqueue.queues_max = 256
fs.nfs.nlm_grace_period = 0
fs.nfs.nlm_tcpport = 0
fs.nfs.nlm_timeout = 10
fs.nfs.nlm_udpport = 0
fs.nfs.nsm_local_state = 3
fs.nfs.nsm_use_hostnames = 0
fs.nr_open = 1048576
fs.overflowgid = 65534
fs.overflowuid = 65534
fs.pipe-max-size = 1048576
fs.pipe-user-pages-hard = 0
fs.pipe-user-pages-soft = 16384
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
fs.quota.allocated_dquots = 0
fs.quota.cache_hits = 0
fs.quota.drops = 0
fs.quota.free_dquots = 0
fs.quota.lookups = 0
fs.quota.reads = 0
fs.quota.syncs = 0
fs.quota.warnings = 1
fs.quota.writes = 0
fs.suid_dumpable = 0
fs.xfs.age_buffer_centisecs = 1500
fs.xfs.error_level = 3
fs.xfs.filestream_centisecs = 3000
fs.xfs.inherit_noatime = 1
fs.xfs.inherit_nodefrag = 1
fs.xfs.inherit_nodump = 1
fs.xfs.inherit_nosymlinks = 0
fs.xfs.inherit_sync = 1
fs.xfs.irix_sgid_inherit = 0
fs.xfs.irix_symlink_mode = 0
fs.xfs.panic_mask = 0
fs.xfs.rotorstep = 1
fs.xfs.speculative_prealloc_lifetime = 300
fs.xfs.stats_clear = 0
fs.xfs.xfsbufd_centisecs = 100
fs.xfs.xfssyncd_centisecs = 3000
kernel.acct = 4 2 30
kernel.acpi_video_flags = 0
kernel.auto_msgmni = 1
kernel.bootloader_type = 114
kernel.bootloader_version = 2
kernel.cad_pid = 1
kernel.cap_last_cap = 36
kernel.compat-log = 1
kernel.core_pattern = core
kernel.core_pipe_limit = 0
kernel.core_uses_pid = 1
kernel.ctrl-alt-del = 0
kernel.dmesg_restrict = 0
kernel.domainname = (none)
kernel.ftrace_dump_on_oops = 0
kernel.ftrace_enabled = 1
kernel.hardlockup_all_cpu_backtrace = 0
kernel.hardlockup_panic = 1
kernel.hostname = htc-sblock-node197
kernel.hotplug = 
kernel.hung_task_check_count = 4194304
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 120
kernel.hung_task_warnings = 0
kernel.io_delay_type = 0
kernel.kexec_load_disabled = 0
kernel.keys.gc_delay = 300
kernel.keys.maxbytes = 20000
kernel.keys.maxkeys = 200
kernel.keys.persistent_keyring_expiry = 259200
kernel.keys.root_maxbytes = 25000000
kernel.keys.root_maxkeys = 1000000
kernel.kptr_restrict = 0
kernel.max_lock_depth = 1024
kernel.modprobe = /sbin/modprobe
kernel.modules_disabled = 0
kernel.msg_next_id = -1
kernel.msgmax = 8192
kernel.msgmnb = 16384
kernel.msgmni = 32768
kernel.ngroups_max = 65536
kernel.nmi_watchdog = 1
kernel.ns_last_pid = 176562
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
kernel.numa_balancing_settle_count = 4
kernel.osrelease = 3.10.0-957.5.1.el7.x86_64
kernel.ostype = Linux
kernel.overflowgid = 65534
kernel.overflowuid = 65534
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_stackoverflow = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.perf_cpu_time_max_percent = 25
kernel.perf_event_max_sample_rate = 32000
kernel.perf_event_mlock_kb = 516
kernel.perf_event_paranoid = 2
kernel.pid_max = 196608
kernel.poweroff_cmd = /sbin/poweroff
kernel.print-fatal-signals = 0
kernel.printk = 7 4 1 7
kernel.printk_delay = 0
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10
kernel.pty.max = 4096
kernel.pty.nr = 4
kernel.pty.reserve = 1024
kernel.random.boot_id = 5bd2b4ab-221e-4157-98ad-fe4a81da7784
kernel.random.entropy_avail = 4034
kernel.random.poolsize = 4096
kernel.random.read_wakeup_threshold = 64
kernel.random.urandom_min_reseed_secs = 60
kernel.random.uuid = 4f4a6d22-d974-452d-b550-0e19b7a3c74e
kernel.random.write_wakeup_threshold = 896
kernel.randomize_va_space = 2
kernel.real-root-dev = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_cfs_bandwidth_slice_us = 5000
kernel.sched_child_runs_first = 0
kernel.sched_latency_ns = 24000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 3000000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 100
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_schedstats = 0
kernel.sched_shares_window_ns = 10000000
kernel.sched_time_avg_ms = 1000
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 4000000
kernel.seccomp.actions_avail = kill trap errno trace allow
kernel.seccomp.actions_logged = kill trap errno trace
kernel.sem = 250  32000 32  128
kernel.sem_next_id = -1
kernel.shm_next_id = -1
kernel.shm_rmid_forced = 0
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
kernel.softlockup_all_cpu_backtrace = 0
kernel.softlockup_panic = 0
kernel.spl.hostid = 0
kernel.spl.kmem.slab_kmem_alloc = 0
kernel.spl.kmem.slab_kmem_max = 0
kernel.spl.kmem.slab_kmem_total = 0
kernel.spl.kmem.slab_vmem_alloc = 305947392
kernel.spl.kmem.slab_vmem_max = 732324608
kernel.spl.kmem.slab_vmem_total = 347979264
kernel.spl.version = SPL v0.7.12-1
kernel.stack_tracer_enabled = 0
kernel.sysctl_writes_strict = 1
kernel.sysrq = 16
kernel.tainted = 12289
kernel.threads-max = 4126958
kernel.timer_migration = 1
kernel.traceoff_on_warning = 0
kernel.unknown_nmi_panic = 0
kernel.usermodehelper.bset = 4294967295 31
kernel.usermodehelper.inheritable = 4294967295  31
kernel.version = #1 SMP Fri Feb 1 14:54:57 UTC 2019
kernel.watchdog = 1
kernel.watchdog_cpumask = 0-191
kernel.watchdog_thresh = 10
kernel.yama.ptrace_scope = 0
sunrpc.max_resvport = 1023
sunrpc.min_resvport = 665
sunrpc.nfs_debug = 0x0000
sunrpc.nfsd_debug = 0x0000
sunrpc.nlm_debug = 0x0000
sunrpc.rpc_debug = 0x0000
sunrpc.tcp_fin_timeout = 15
sunrpc.tcp_max_slot_table_entries = 65536
sunrpc.tcp_slot_table_entries = 2
sunrpc.transports = tcp 1048576
sunrpc.transports = udp 32768
sunrpc.transports = tcp-bc 1048576
sunrpc.udp_slot_table_entries = 16
user.max_ipc_namespaces = 2063479
user.max_mnt_namespaces = 2063479
user.max_pid_namespaces = 2063479
user.max_user_namespaces = 0
user.max_uts_namespaces = 2063479
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.drop_caches = 0
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256 256 32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 90112
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.mmap_rnd_bits = 28
vm.mmap_rnd_compat_bits = 8
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

如果我还可以添加任何其他有用内容,请告诉我。

答案1

这是一个赠品,因为我认为工作范围延伸到付费 ZFS 咨询

  • 您的机柜如何布线?
  • 您有 12 个外部 JBOD,但没有迹象表明已启用多路径
  • 考虑一下离线的磁盘与磁盘阵列和 zpool 的关系
  • 当使用如此多的机柜时,我始终主张使用 SAS 布线环形拓扑
  • 如果没有的话,我会努力实现
  • /dev/mapper在这种情况下,您的池也应该由多路径设备组成
  • 你能展示一下你的/etc/modprobe.d/zfs.conf吗?
  • 所有磁盘都是 SAS 吗?

SAS 多路径布线示例:

在此处输入图片描述

相关内容