提问人:Марсель Абдуллин 提问时间:11/17/2023 更新时间:11/17/2023 访问量:20
简单查询 pyspark 中 GC 开销的原因是什么?
What is a reason of GC overhead in the simple query pyspark?
问:
我使用apache iceberg作为数据格式执行查询
表的 DDL(类似于 raw 和 ods)
CREATE TABLE ods.kafka_trbMetaEventTopic_v1 (
objectId long,
hasSign string,
fileName string,
fileExt string,
created timestamp,
tech_timestamp TIMESTAMP,
tech_raw_timestamp TIMESTAMP,
tech_date DATE,
tech_raw_date DATE,
schema_v_num INT
)
USING iceberg
PARTITIONED BY (tech_date, days(created));
- 我开始搜索最大值,它立即执行
spark.sql("SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1").show()
- 我在请求中替换了这个值,它也执行得很快
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = '2023-11-13'").show()
- 我合并了两个请求,挂在驱动程序上,gc 吃掉时间(屏幕截图上的红色)
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = (SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1)").count()
似乎它应该首先找到一个最大值,然后将该值放入过滤器查询中(这 2 个操作快速执行)
3 query am
答: 暂无答案
评论