我们的 Kafka 生产集群包括23台经纪机,每个经纪人包括三十五JBOD 磁盘
代理版本为 - apache 2.7版本,集群包含5个zookeeper服务器
我们/var/log/kafka/server.log
看到很多类似下面的警告
Log directory /var/kafka/broker_logsXX is offline
或者以下是一些例子server.log
[2023-09-22 02:10:34,583] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs3 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs3 is offline
[2023-09-22 02:10:34,588] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs24 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs24 is offline
[2023-09-22 02:10:34,588] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs9 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs9 is offline
[2023-09-22 02:10:34,594] WARN [ReplicaManager broker=1012] Unable to describe replica dirs for /var/kafka/broker_logs39 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Log directory /var/kafka/broker_logs39 is offline
[2023-11-03 09:52:46,390] INFO Updated cache from existing <empty> to latest FinalizedFeaturesAndEpoch(features=Features{}, epoch=1). (kafka.server.FinalizedFeatureCache)
[2023-11-03 09:52:46,393] INFO Cluster ID = XXXXXXXXXXX (kafka.server.KafkaServer)
[2023-11-03 09:52:46,724] ERROR Fail to read meta.properties under log directory /var/kafka/broker_logs24 (kafka.server.KafkaServer)
java.nio.file.FileSystemException: /var/kafka/broker_logs24/meta.properties.tmp: Input/output error
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165)
at kafka.server.BrokerMetadataCheckpoint.read(BrokerMetadataCheckpoint.scala:69)
at kafka.server.KafkaServer.$anonfun$getBrokerMetadataAndOfflineDirs$1(KafkaServer.scala:784)
at kafka.server.KafkaServer.$anonfun$getBrokerMetadataAndOfflineDirs$1$adapted(KafkaServer.scala:782)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:553)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:551)
at scala.collection.AbstractIterable.foreach(Iterable.scala:920)
at kafka.server.KafkaServer.getBrokerMetadataAndOfflineDirs(KafkaServer.scala:782)
at kafka.server.KafkaServer.startup(KafkaServer.scala:246)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:44)
at kafka.Kafka$.main(Kafka.scala:82)
at kafka.Kafka.main(Kafka.scala)
不清楚为什么 Kafka 代理抱怨日志目录/var/kafka/....
处于离线状态?
是因为磁盘问题吗?还是磁盘负载过高?或者可能是代理未同步到目标副本?等等。
在最后一个例子中我们可以看到
java.nio.file.FileSystemException: /var/kafka/broker_logs24/meta.properties.tmp: Input/output error
据我理解,input/output
错误可能是典型的磁盘问题
但当我查看所有 35 台代理机器时,我们可以看到大约 4-6 个警告 Log directory /var/kafka/... is offline
那么每个代理有 4-6 个故障磁盘是没有意义的吗?