YARN + yarn 资源管理器在 zookeeper 中存储了大量与正在运行/旧应用程序相关的 znode

YARN + yarn resource manager stores a ton of znodes related to running/old applications in zookeeper

提问人:jessica 提问时间:11/16/2023 最后编辑:jessica 更新时间:11/17/2023 访问量:58

问:

我们有HDP生产Hadoop集群,包括2个主备资源管理器服务

来自我们集群的一些细节

  1. HDP 版本 - 2.6.5

  2. 操作系统 Linux 计算机版本 - 7.9

  3. 节点管理器/数据节点机器数量 - - 487

从 RM 日志中,我们看到资源管理器与 zookeeper 服务器存在连接问题,

在深入研究这个问题之后,我们从 zookeeper cli 看到了以下内容

[zk: localhost:2181(CONNECTED) 19] ls /rmstore/ZKRMStateRoot/RMAppRoot

[application_1700036808155_0199, application_1700036808155_0198, application_1700036808155_0195, application_1700036808155_0194, application_1700036808155_0197, application_1700036808155_0196, application_1700036808155_0191, application_1700036808155_0190, application_1700036808155_0193, application_1700036808155_0192, application_1699507654048_0002, application_1699507654048_0001, application_1700036808155_0126, application_1700036808155_0125, application_1700036808155_0128, application_1700036808155_0127, application_1700036808155_0122, application_1700036808155_0121, application_1700036808155_0124, application_1700036808155_0123, application_1698104640063_4149, application_1698104640063_4147, application_1698104640063_4148, application_1700036808155_0129, application_1698104640063_4145, application_1698104640063_4146, application_1698104640063_4154, application_1698104640063_4155, application_1698104640063_4152, application_1698104640063_4153, application_1698104640063_4150, application_1698104640063_4151, application_1700036808155_0120, application_1700036808155_0115, application_1700036808155_0114, application_1700036808155_0117, application_1700036808155_0116, application_1700036808155_0111, application_1700036808155_0110, application_1700036808155_0113, application_1700036808155_0112, application_1698104640063_4158, application_1700036808155_0119, application_1698104640063_4159, application_1700036808155_0118, application_1698104640063_4156, application_1698104640063_4157, application_1698104640063_4165, application_1698104640063_4166, application_1698104640063_4163, application_1698104640063_4164, application_1698104640063_4161, application_1698104640063_4162, application_1698104640063_4160, application_1700036808155_0148, application_1700036808155_0147, application_1700036808155_0149, application_1700036808155_0144, application_1700036808155_0143, application_1700036808155_0146, application_1700036808155_0145, application_1698104640063_4129, application_1698104640063_4127, application_1698104640063_4128, application_1698104640063_4125, application_1698104640063_4126, application_1698104640063_4123, application_1698104640063_4124, application_1698104640063_4132, application_1698104640063_4133, application_1698104640063_4130, application_1698104640063_4131, application_1700036808155_0140, application_1700036808155_0142, application_1700036808155_0141, application_1700036808155_0137, application_1700036808155_0136, application_1700036808155_0139, application_1700036808155_0138, application_1700036808155_0133, application_1700036808155_0132, application_1700036808155_0135, application_1700036808155_0134, application_1698104640063_4138, application_1698104640063_4139, application_1698104640063_4136, application_1698104640063_4137, application_1698104640063_4134, application_1698104640063_4135, application_1698104640063_4143, application_1698104640063_4144, application_1698104640063_4141, application_1698104640063_4142, application_1698104640063_4140, application_1700036808155_0131, application_1700036808155_0130, .......

当我们使用 Zookeeper 的统计数据时,我们发现

[zk: localhost:2181(CONNECTED) 20] stat /rmstore/ZKRMStateRoot/RMAppRoot
cZxid = 0x10000006b
ctime = Mon Jan 18 20:03:47 UTC 2021
mZxid = 0x10000006b
mtime = Mon Jan 18 20:03:47 UTC 2021
pZxid = 0x44f00082a60
cversion = 1916163
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 10009  <==each of these znodes have also children related to app attempts

第一个问题

从这种糟糕的情况来看,为什么 Zookeeper 不清理或清除旧数据?或者根据时间戳的旧数据,或者可能是旧的 RM 应用程序 ID

我的感觉是,当我们在/rmstore/ZKRMStateRoot/RMAppRoot下有大量数据时,RM高可用性集群无法读取RMAppRoot zookeeper文件夹下的数据

感谢您获得如何清理 Zookeeper 旧数据的想法,或者在 Zookeeper 配置中设置什么,以便删除/清除/删除不再使用的旧数据

第二个问题:

如果我删除 下的所有 znodes,会有什么后果,在不影响 YARN 资源管理器功能的情况下进行此删除是否正确/rmstore/ZKRMStateRoot/RMAppRoot/

[zk: localhost:2181(CONNECTED) 10] rmr /rmstore/ZKRMStateRoot/RMAppRoot/*

也许其他相关文档

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ref-c2ececdf-c68e-4095-99b5-15b4c31701ba.1.html

https://community.cloudera.com/t5/Support-Questions/How-To-Best-Resolve-RMStateStore-FENCED/td-p/96032

https://blog.csdn.net/qq_42264264/article/details/130827532

linux hadoop hadoop-yarn apache-zookeeper hdp

评论


答: 暂无答案