什么可以阻止 linux ahci 中的硬盘热插拔?

什么可以阻止 linux ahci 中的硬盘热插拔?

我对这个问题苦恼不已。

我想在我的家庭服务器中添加一个热插拔托架,以便轻松添加和移除硬盘,例如轻松轮换异地备份。有问题的主板是 Asrock J4105-ITX 主板,带有四个原生 SATA 端口,分为 ASM1062 和英特尔处理器 SATA 控制器。两者都运行良好并使用ahci内核模块。BIOS 中有一个热插拔选项,但似乎没有效果。

如果驱动器断开连接(通过echo 1 > /sys/block/sdX/device/delete或粗暴地移除驱动器),重新连接后将无法识别任何新设备。我尝试过强制重新扫描(echo "- - -" > /sys/class/scsi_host/host<n>/scan),但无济于事,SATA 端口实际上在下次重新启动之前无法使用。我还尝试了一些更极端的命令,但没有任何效果:

echo 1 > /sys/class/scsi_device/2:0:0:0/device/reset
echo 1 > /sys/devices/pci0000:00/0000:00:1f.2/rescan
echo 1 > /sys/devices/pci0000:00/0000:00:1f.2/reset

(取自如何让 Linux 识别我热插拔的新 SATA /dev/sda 驱动器而无需重新启动?

“好吧,可能是芯片组不支持热插拔,或者 BIOS 有问题。”所以我订购了两个 PCIe SATA 控制器(一个使用 ASM1064,另一个使用 Marvell 88SE9215)。两者都出现了同样的问题,尽管其他买家表示热插拔对他们有用,所以我猜问题要么与软件有关(我的安装?我正在运行 Arch OS,它一直保持最新状态)。

一些希望有用的信息:

$ uname -a
Linux servername 5.14.14-arch1-1 #1 SMP PREEMPT Wed, 20 Oct 2021 21:35:18 +0000 x86_64 GNU/Linux

$ dmesg | grep ahci
[    0.447450] ahci 0000:00:12.0: version 3.0
[    0.447842] ahci 0000:00:12.0: SSS flag set, parallel bus scan disabled
[    0.457970] ahci 0000:00:12.0: AHCI 0001.0301 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
[    0.457981] ahci 0000:00:12.0: flags: 64bit ncq sntf stag pm clo only pmp pio slum part sxs deso sadm sds apst 
[    0.458750] scsi host0: ahci
[    0.459204] scsi host1: ahci
[    0.469788] ahci 0000:01:00.0: AHCI 0001.0000 32 slots 4 ports 6 Gbps 0xf impl SATA mode
[    0.469801] ahci 0000:01:00.0: flags: 64bit ncq sntf led only pmp fbs pio slum part sxs 
[    0.470767] scsi host2: ahci
[    0.471203] scsi host3: ahci
[    0.471562] scsi host4: ahci
[    0.471904] scsi host5: ahci
[    0.472341] ahci 0000:04:00.0: SSS flag set, parallel bus scan disabled
[    0.472376] ahci 0000:04:00.0: AHCI 0001.0200 32 slots 2 ports 6 Gbps 0x3 impl SATA mode
[    0.472382] ahci 0000:04:00.0: flags: 64bit ncq sntf stag led clo pmp pio slum part ccc 
[    0.472803] scsi host6: ahci
[    0.473011] scsi host7: ahci

$ lspci -v
[...]
01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller (rev 11) (prog-if 01 [AHCI 1.0])
    Subsystem: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
    Flags: bus master, fast devsel, latency 0, IRQ 127
    I/O ports at e050 [size=8]
    I/O ports at e040 [size=4]
    I/O ports at e030 [size=8]
    I/O ports at e020 [size=4]
    I/O ports at e000 [size=32]
    Memory at a1340000 (32-bit, non-prefetchable) [size=2K]
    Expansion ROM at a1300000 [disabled] [size=256K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [70] Express Legacy Endpoint, MSI 00
    Capabilities: [e0] SATA HBA v0.0
    Capabilities: [100] Advanced Error Reporting
    Kernel driver in use: ahci
[...]

答案1

我终于找到了原因:我的 powertop 调整太激进了!

因为这台服务器全天候运行,而且这边的电费比较贵,所以我添加了一个 systemd 服务来自动调整所有 powertop 选项:

$ cat /etc/systemd/system/powertop.service
[Unit]
Description=Powertop tunings

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/powertop --auto-tune

[Install]
WantedBy=multi-user.target

这与打开 powertop tui 并将所有选项设置为“Good”相同。关键部分是关于的四行Runtime PM for port ataX

   Good          Runtime PM for port ata3 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
   Bad           Runtime PM for port ata4 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
   Good          Runtime PM for port ata5 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
>> Good          Runtime PM for port ata6 of PCI device: Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller
   Good          Runtime PM for PCI Device Marvell Technology Group Ltd. 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller

它们执行的操作echo 'auto' > '/sys/bus/pci/devices/0000:01:00.0/ata4/power/control';显然导致 SATA 卡永远无法识别端口上的新设备!

只有设置power/controlon(powertop 中的“坏”选项)后,卡才会在执行后找到新设备echo 0 0 0 | sudo tee /sys/class/scsi_host/host*/scan

我唯一缺少的是自动重新扫描,因为我的台式电脑会自动找到新设备,而无需写入hostX/scan,但我目前可以忍受这一点。这是一个非常令人沮丧的经历,所以我希望这可以帮助面临同样问题的人。

相关内容