kafka +日志目录离线是什么意思

kafka +日志目录离线是什么意思

我们的 Kafka 生产集群包括23台经纪机,每个经纪人包括三十五JBOD 磁盘

代理版本为 - apache 2.7版本,集群包含5个zookeeper服务器

我们/var/log/kafka/server.log看到很多类似下面的警告

Log directory /var/kafka/broker_logsXX is offline

或者以下是一些例子server.log

[2023-09-22 02:10:34,583] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs3 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs3 is offline
[2023-09-22 02:10:34,588] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs24 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs24 is offline
[2023-09-22 02:10:34,588] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs9 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs9 is offline
[2023-09-22 02:10:34,594] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs39 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs39 is offline





[2023-11-03 09:52:46,390] INFO Updated cache from existing <empty> to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). (kafka.server.FinalizedFeatureCache)
[2023-11-03 09:52:46,393] INFO Cluster ID = XXXXXXXXXXX (kafka.server.KafkaServer)
[2023-11-03 09:52:46,724] ERROR Fail to read meta.properties under log directory /var/kafka/broker_logs24 (kafka.server.KafkaServer)
java.nio.file.FileSystemException: /var/kafka/broker_logs24/meta.properties.tmp: Input/output error
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
        at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at kafka.server.BrokerMetadataCheckpoint.read(BrokerMetadataCheckpoint.scala:69)
        at kafka.server.KafkaServer.$anonfun$getBrokerMetadataAndOfflineDirs$1(KafkaServer.scala:784)
        at kafka.server.KafkaServer.$anonfun$getBrokerMetadataAndOfflineDirs$1$adapted(KafkaServer.scala:782)
        at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
        at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
        at kafka.server.KafkaServer.getBrokerMetadataAndOfflineDirs(KafkaServer.scala:782)
        at kafka.server.KafkaServer.startup(KafkaServer.scala:246)
        at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
        at kafka.Kafka$.main(Kafka.scala:82)
        at kafka.Kafka.main(Kafka.scala)

不清楚为什么 Kafka 代理抱怨日志目录/var/kafka/....处于离线状态?

是因为磁盘问题吗?还是磁盘负载过高?或者可能是代理未同步到目标副本?等等。

在最后一个例子中我们可以看到

java.nio.file.FileSystemException: /var/kafka/broker_logs24/meta.properties.tmp: Input/output error

据我理解,input/output错误可能是典型的磁盘问题

但当我查看所有 35 台代理机器时,我们可以看到大约 4-6 个警告 Log directory /var/kafka/... is offline

那么每个代理有 4-6 个故障磁盘是没有意义的吗?

相关内容