简单查询 pyspark 中 GC 开销的原因是什么？-解网

问：

我使用apache iceberg作为数据格式执行查询

表的 DDL（类似于 raw 和 ods）

CREATE TABLE ods.kafka_trbMetaEventTopic_v1 (
        objectId long,
        hasSign string,
        fileName string,
        fileExt string,
        created timestamp,
        tech_timestamp TIMESTAMP,
        tech_raw_timestamp TIMESTAMP,
        tech_date DATE,
        tech_raw_date DATE,
        schema_v_num INT
    )
    USING iceberg
    PARTITIONED BY (tech_date, days(created));

我开始搜索最大值，它立即执行

spark.sql("SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1").show()

我在请求中替换了这个值，它也执行得很快

spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = '2023-11-13'").show()

我合并了两个请求，挂在驱动程序上，gc 吃掉时间（屏幕截图上的红色）

spark.sql("select objectId, objectId_new, hasSign, fileName, fileExt, created, tech_timestamp as tech_raw_timestamp, tech_date as tech_raw_date from raw.kafka_trbMetaEventTopic_v1 where tech_date = (SELECT coalesce(max(tech_raw_date), '1970-01-01') FROM ods.kafka_trbMetaEventTopic_v1)").count()

似乎它应该首先找到一个最大值，然后将该值放入过滤器查询中（这 2 个操作快速执行）

3 query am

PySpark 冰山

简单查询 pyspark 中 GC 开销的原因是什么？

What is a reason of GC overhead in the simple query pyspark?

评论