磁盘错误很多,但没有硬件警报。这与硬件有关还是与软件有关?

磁盘错误很多,但没有硬件警报。这与硬件有关还是与软件有关?

摘要:Hadoop 集群中某个数据节点上的随机磁盘不断变为只读状态。作业失败,但服务器上没有硬件警报。

你好,

我正在管理在 CentOS 7(7.4.1708)上运行的 Hadoop 集群。

数据科学团队长期以来一直在处理一些失败的任务。与此同时,我们的存储磁盘(在某个特定的数据节点上)也变成了只读状态。

由于我们收到的初始异常具有误导性,我们无法将两者联系起来(事实上,我们找不到它们相关的证据)。fsck每次一个磁盘变为只读时,我都会运行(使用 -a 标签进行自动修复),但它只修复了逻辑块,但没有发现任何硬件错误。

我们已经建立了两个问题之间的关系,因为我们发现所有失败的作业都在将该特定节点用作应用程序主节点。

尽管操作系统层面存在大量磁盘错误,但服务器上没有报告硬件错误/警报(LED 信号/硬件接口)。获得此类硬件问题报告是否是将问题称为硬件问题的必要条件?

提前致谢。

操作系统:CentOS 7.4.1708

硬件:HPE Apollo 4530

硬盘:HPE MB6000GEFNB 765251-002(6TB 6G 热插拔 SATA 7.2K 3.5 英寸 512e MDL LP HDD)-(告知不支持 Smart)

您可以查找应用程序和系统日志的详细信息。

我们在问题节点的 Yarn NodeManager 日志中发现以下异常:

2018-06-04 06:54:27,390 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[LocalizerRunner for container_e77_1527963665893_4250_01_000009,5,main] threw an Exception.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException
        at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:259)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1138)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
        at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
        at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:251)
        ... 1 more
2018-06-04 06:54:27,394 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1134)) - Localizer failed
java.lang.RuntimeException: Error while running command to get file permissions : java.io.InterruptedIOException: java.lang.InterruptedException
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:947)
        at org.apache.hadoop.util.Shell.run(Shell.java:848)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1236)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1218)
        at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1077)
        at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:686)
        at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:661)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1440)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:141)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1111)
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.UNIXProcess.waitFor(UNIXProcess.java:396)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
        ... 11 more

        at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:726)
        at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:661)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1440)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1404)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:141)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1111)

并且节点的 HDFS 日志中存在一些罕见的异常,如下所示:

2018-06-10 06:55:27,280 ERROR datanode.DataNode (DataXceiver.java:run(278)) - dnode003.mycompany.local:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.0.0.17:50095 dst: /10.0.0.13:50010
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:203)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:929)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:817)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
        at java.lang.Thread.run(Thread.java:745)

Linux 系统(dmesg)日志:

[  +0.000108] Buffer I/O error on device sdn1, logical block 174931199
[  +0.756448] JBD2: Detected IO errors while flushing file data on sdn1-8
[Jun11 14:57] hpsa 0000:07:00.0: scsi 1:0:0:2: resetting Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap- En- Exp=3
[Jun11 14:58] hpsa 0000:07:00.0: scsi 1:0:0:2: reset completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap- En- Exp=3
[  +0.000176] hpsa 0000:07:00.0: scsi 1:0:0:4: resetting Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap- En- Exp=3
[  +0.000424] hpsa 0000:07:00.0: scsi 1:0:0:4: reset completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap- En- Exp=3
[Jun11 15:24] EXT4-fs error (device sdo1): ext4_mb_generate_buddy:757: group 32577, block bitmap and bg descriptor inconsistent: 31238 vs 31241 free clusters
[  +0.013631] JBD2: Spotted dirty metadata buffer (dev = sdo1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
...
...

[Jun12 04:56] sd 1:0:0:11: [sdm] tag#163 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  +0.000016] sd 1:0:0:11: [sdm] tag#163 Sense Key : Medium Error [current]
[  +0.000019] sd 1:0:0:11: [sdm] tag#163 Add. Sense: Unrecovered read error
[  +0.000004] sd 1:0:0:11: [sdm] tag#163 CDB: Write(16) 8a 00 00 00 00 00 44 1f a4 00 00 00 04 00 00 00
[  +0.000002] blk_update_request: critical medium error, dev sdm, sector 1142924288
[  +0.000459] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865537)
[  +0.000004] Buffer I/O error on device sdm1, logical block 142865280
[  +0.000216] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865538)
[  +0.000003] Buffer I/O error on device sdm1, logical block 142865281
[  +0.000228] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865539)
[  +0.000002] Buffer I/O error on device sdm1, logical block 142865282
[  +0.000247] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865540)
[  +0.000002] Buffer I/O error on device sdm1, logical block 142865283
[  +0.000297] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865541)
[  +0.000003] Buffer I/O error on device sdm1, logical block 142865284
[  +0.000235] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865542)
[  +0.000003] Buffer I/O error on device sdm1, logical block 142865285
[  +0.000241] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865543)
[  +0.000002] Buffer I/O error on device sdm1, logical block 142865286
[  +0.000223] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865544)
[  +0.000002] Buffer I/O error on device sdm1, logical block 142865287
[  +0.000210] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865545)
[  +0.000003] Buffer I/O error on device sdm1, logical block 142865288
[  +0.000227] EXT4-fs warning (device sdm1): ext4_end_bio:332: I/O error -61 writing to inode 61451821 (offset 0 size 0 starting block 142865546)
[  +0.000002] Buffer I/O error on device sdm1, logical block 142865289
[  +0.000192] Buffer I/O error on device sdm1, logical block 142865290

答案1

一些 HPE 智能阵列控制器存在固件错误,这些错误可能会锁定控制器,并且可能会或不会将任何错误记录到集成管理日志中。

您可能会受到本通告

解决此问题的方法是升级智能阵列控制器固件。以下是从公告中复制的解决方案说明:

Smart Array/HBA 固件版本 4.02(或更高版本)将解决该问题。

执行以下步骤以获取最新版本的智能阵列/HBA 固件版本:

  1. 点击以下链接: http://h20566.www2.hpe.com/portal/site/hpsc?ac.admitted=1447799799154.125225703.1938120508

  2. 在“输入产品名称或编号”下拉框中,输入控制器的名称。

  3. 选择“获取驱动程序、软件和固件”。

  4. 选择适当的操作系统。

  5. 选择类别“固件 - 存储控制器”。

  6. 找到、下载并安装智能阵列固件版本 4.02(或更高版本)。

如果您在执行上述说明时遇到问题,您可以通过在 Web 上搜索您的智能阵列型号加上“固件”并选择“驱动程序”结果来更快地找到固件,如下所示:

谷歌搜索“Smart Array P440 固件”

相关内容