kafka 代理因堆 OOM 而脱机

kafka broker go offline due to heap OOM

提问人:Gavin Gu 提问时间:11/8/2018 最后编辑:Gavin Gu 更新时间:11/9/2018 访问量:882

问:

我们最近发现我们的 kafka 集群在生产环境中离线了 有四个 broker,replicationFactor 为 2,KAFKA_HEAP_OPTS为 -Xmx30G -Xms30G

服务器.log:

[2018-10-18 14:12:01,340] WARN [ReplicaFetcherThread-0-3], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@785b3ea5 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
        at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
        at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
        at scala.Option.foreach(Option.scala:257)
        at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
        at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
        at kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
        at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
        at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2018-10-18 14:12:03,959] INFO [GroupCoordinator 4]: Preparing to restabilize group 8017 with old generation 11 (kafka.coordinator.GroupCoordinator)
[2018-10-18 14:12:03,958] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Java heap space
[2018-10-18 14:12:02,156] ERROR [KafkaApi-4] Error when handling request {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=**********,partitions=[{partition=9,fetch_offset=3610091552,max_bytes=1048576},{partition=13,fetch_offset=3673665102,max_bytes=1048576},{partition=1,fetch_offset=3685463160,max_bytes=1048576},{partition=10,fetch_offset=3628517926,max_bytes=1048576},{partition=5,fetch_offset=3653905643,max_bytes=1048576}]}]} (kafka.server.KafkaApis)
java.lang.OutOfMemoryError: Java heap space
[2018-10-18 14:12:02,155] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Java heap space
[2018-10-18 14:12:01,342] ERROR [ExpirationReaper-4], Error due to  (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)

而在此之前,集群也得到了很多 ZK 在几个小时前过期

./controller.log.2018-10-18-12:[2018-10-18 12:05:51,300] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-12:[2018-10-18 12:42:43,576] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:00:54,919] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:12:26,598] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:24:22,851] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:29:09,095] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:33:14,948] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:37:45,249] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:43:55,640] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:48:53,711] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:51:29,411] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:57:27,588] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-14:[2018-10-18 14:03:20,452] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-14:[2018-10-18 14:06:14,026] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)

任何人都可以调查一下吗?

==============================================================

更多细节的屏幕截图: 那天出现了异常情况,因为下面Zabbix监控了Kafka主题生产者量的信息

某些主题的传入量是平时的一万倍。但是当我们检查上游生产者的日志时,它与正常的生产量进展顺利

apache-kafka

评论

0赞 Gavin Gu 11/8/2018
kafka 的版本是 0.10
0赞 OneCricketeer 11/8/2018
复制因子通常只影响磁盘,而不影响堆,那么您总共有多少个主题,每个代理维护多少个分区?另外,最有用的信息实际上是堆转储,而不是日志。请参阅关于添加HeapDumpOnOutOfMemoryError
0赞 Gavin Gu 11/8/2018
@cricket,感谢您的回复 对于大多数分区,我们有 49 个主题和 16 个分区,当 oom 发生时,我们尝试使用以下命令获取转储:jmap -dump:format=b,file=dump,但出现“无法附加到进程”错误,我们现在添加了 HeapDumpOnOutOfMemoryError 参数,我只是想找出这个问题的可能原因。我还搜索了许多关于 Kafka oom 的文档,但到目前为止还没有得到任何有用的内容,这与我们的情况类似
0赞 Giorgos Myrianthous 11/8/2018
您最近是否启用了 SSL?
0赞 Gavin Gu 11/8/2018
不,我们没有启用 ssl

答: 暂无答案