简单查询 pyspark 中 GC 开销的原因是什么?

What is a reason of GC overhead in the simple query pyspark?

提问人:Марсель Абдуллин 提问时间:11/17/2023 更新时间:11/17/2023 访问量:20

问:

我使用apache iceberg作为数据格式执行查询

表的 DDL(类似于 raw 和 ods)

CREATE TABLE ods.kafka_trbMetaEventTopic_v1 (
        objectId long,
        hasSign string,
        fileName string,
        fileExt string,
        created timestamp,
        tech_timestamp TIMESTAMP,
        tech_raw_timestamp TIMESTAMP,
        tech_date DATE,
        tech_raw_date DATE,
        schema_v_num INT
    )
    USING iceberg
    PARTITIONED BY (tech_date, days(created));
  1. 我开始搜索最大值,它立即执行
spark.sql("SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1").show()
  1. 我在请求中替换了这个值,它也执行得很快
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = '2023-11-13'").show()
  1. 我合并了两个请求,挂在驱动程序上,gc 吃掉时间(屏幕截图上的红色)
spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = (SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1)").count()

似乎它应该首先找到一个最大值,然后将该值放入过滤器查询中(这 2 个操作快速执行)pyspark terminal

3 query am

3 launch master

PySpark 冰山

评论


答: 暂无答案