UserWarning:createDataFrame 尝试在 pyspark createDataFrame 中进行箭头优化

UserWarning: createDataFrame attempted Arrow optimization in pyspark createDataFrame

提问人:RunTheGauntlet 提问时间:11/16/2023 最后编辑:RunTheGauntlet 更新时间:11/16/2023 访问量:80

问:

在具有运行时 12.2 LTS ML(包括 Apache Spark 3.3.2、Scala 2.12)的 Azure DataBricks 中,我正在尝试运行以下脚本:

import pandas as pd
example = pd.DataFrame([{'a':[{'b':'c'}]}])
from pyspark.sql.types import *
schema = StructType([
    StructField("a", ArrayType(MapType(StringType(),StringType())), True),
    ])
query_df = spark.createDataFrame(example, schema)
display(query_df)

代码执行将返回以下警告:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Could not convert {'b': 'c'} with type dict: was not a sequence or recognized null for conversion to list type
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

尽管多次尝试使用各种架构变体,但我还是不断收到此警告。

a) 为什么它试图将 dict 转换为列表?

b) 如何使其正常工作并进行优化,同时在此列中将“字典列表”作为数据类型?

[编辑] 集群配置: 个人计算 12.2 LTS ML(包括 Apache Spark 3.3.2、Scala 2.12) Standard_DS3_v2 Spark 配置:spark.databricks.cluster.profile singleNode spark.master 本地[*, 4]

azure apache-spark pyspark azure-databricks

评论

1赞 DileeprajnarayanThumula 11/16/2023
能否共享群集配置映像。
0赞 RunTheGauntlet 11/16/2023
@DileeprajnarayanThumula 您能链接或描述在哪里可以找到它吗?google 的第一页对“databricks 群集配置映像”没有帮助。
0赞 DileeprajnarayanThumula 11/16/2023
可以在计算下找到群集配置,并能够看到群集。

答:

1赞 JayashankarGS 11/16/2023 #1

您得到的错误是由于数据类型。

以下是基于箭头的转换不支持的数据类型:

MapType、 和嵌套 。ArrayTypeTimestampTypeStructType

因此,如果您提供并使用它进行转换,则会出现错误。MapType

enter image description here

即使对于嵌套的.StructType

import pandas as pd
example = pd.DataFrame([{'a':[{'b':{'c':'d'}}]}])
from pyspark.sql.types import *

schema = StructType([
    StructField("a", StructType([
        StructField("b", StructType([
            StructField("c", StringType(), nullable=True)
        ]), nullable=True)
    ]), nullable=True)
])


query_df2 = spark.createDataFrame(example, schema)
query_df2.toPandas()

错误:

/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
/databricks/spark/python/pyspark/sql/pandas/conversion.py:122: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

以下是带和不带 .MapType

query_df1 = spark.createDataFrame(example)
query_df2 = spark.createDataFrame(example, schema)
query_df2.printSchema()
query_df1.printSchema()

输出:

root
 |-- a: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: string (nullable = true)

因此,请使用受支持的架构。 有关更多信息,请参阅此文档