提问人:Gavin Gu 提问时间:11/8/2018 最后编辑:Gavin Gu 更新时间:11/9/2018 访问量:882
kafka 代理因堆 OOM 而脱机
kafka broker go offline due to heap OOM
问:
我们最近发现我们的 kafka 集群在生产环境中离线了 有四个 broker,replicationFactor 为 2,KAFKA_HEAP_OPTS为 -Xmx30G -Xms30G
服务器.log:
[2018-10-18 14:12:01,340] WARN [ReplicaFetcherThread-0-3], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@785b3ea5 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
at kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2018-10-18 14:12:03,959] INFO [GroupCoordinator 4]: Preparing to restabilize group 8017 with old generation 11 (kafka.coordinator.GroupCoordinator)
[2018-10-18 14:12:03,958] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Java heap space
[2018-10-18 14:12:02,156] ERROR [KafkaApi-4] Error when handling request {replica_id=-1,max_wait_time=100,min_bytes=1,topics=[{topic=**********,partitions=[{partition=9,fetch_offset=3610091552,max_bytes=1048576},{partition=13,fetch_offset=3673665102,max_bytes=1048576},{partition=1,fetch_offset=3685463160,max_bytes=1048576},{partition=10,fetch_offset=3628517926,max_bytes=1048576},{partition=5,fetch_offset=3653905643,max_bytes=1048576}]}]} (kafka.server.KafkaApis)
java.lang.OutOfMemoryError: Java heap space
[2018-10-18 14:12:02,155] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Java heap space
[2018-10-18 14:12:01,342] ERROR [ExpirationReaper-4], Error due to (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
而在此之前,集群也得到了很多 ZK 在几个小时前过期
./controller.log.2018-10-18-12:[2018-10-18 12:05:51,300] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-12:[2018-10-18 12:42:43,576] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:00:54,919] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:12:26,598] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:24:22,851] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:29:09,095] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:33:14,948] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:37:45,249] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:43:55,640] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:48:53,711] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:51:29,411] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-13:[2018-10-18 13:57:27,588] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-14:[2018-10-18 14:03:20,452] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
./controller.log.2018-10-18-14:[2018-10-18 14:06:14,026] INFO [SessionExpirationListener on 4], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
任何人都可以调查一下吗?
==============================================================
更多细节的屏幕截图: 那天出现了异常情况,因为下面Zabbix监控了Kafka主题生产者量的信息
某些主题的传入量是平时的一万倍。但是当我们检查上游生产者的日志时,它与正常的生产量进展顺利
答: 暂无答案
评论
HeapDumpOnOutOfMemoryError