UserWarning：createDataFrame 尝试在 pyspark createDataFrame 中进行箭头优化-解网

问：

在具有运行时 12.2 LTS ML（包括 Apache Spark 3.3.2、Scala 2.12）的 Azure DataBricks 中，我正在尝试运行以下脚本：

import pandas as pd
example = pd.DataFrame([{'a':[{'b':'c'}]}])
from pyspark.sql.types import *
schema = StructType([
    StructField("a", ArrayType(MapType(StringType(),StringType())), True),
    ])
query_df = spark.createDataFrame(example, schema)
display(query_df)

代码执行将返回以下警告：

/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Could not convert {'b': 'c'} with type dict: was not a sequence or recognized null for conversion to list type
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

尽管多次尝试使用各种架构变体，但我还是不断收到此警告。

a）为什么它试图将 dict 转换为列表？

b）如何使其正常工作并进行优化，同时在此列中将“字典列表”作为数据类型？

[编辑] 集群配置：个人计算 12.2 LTS ML（包括 Apache Spark 3.3.2、Scala 2.12） Standard_DS3_v2 Spark 配置：spark.databricks.cluster.profile singleNode spark.master 本地[*， 4]

azure apache-spark pyspark azure-databricks

import pandas as pd
example = pd.DataFrame([{'a':[{'b':{'c':'d'}}]}])
from pyspark.sql.types import *

schema = StructType([
    StructField("a", StructType([
        StructField("b", StructType([
            StructField("c", StringType(), nullable=True)
        ]), nullable=True)
    ]), nullable=True)
])


query_df2 = spark.createDataFrame(example, schema)
query_df2.toPandas()

错误：

/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
/databricks/spark/python/pyspark/sql/pandas/conversion.py:122: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

以下是带和不带 .MapType

query_df1 = spark.createDataFrame(example)
query_df2 = spark.createDataFrame(example, schema)
query_df2.printSchema()
query_df1.printSchema()

输出：

root
 |-- a: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: string (nullable = true)

因此，请使用受支持的架构。有关更多信息，请参阅此文档。

上一个：使用增量实时表时本地卷文件的正确路径

下一个：验证 csv 列 pyspark 的 iso datetime

UserWarning：createDataFrame 尝试在 pyspark createDataFrame 中进行箭头优化

UserWarning: createDataFrame attempted Arrow optimization in pyspark createDataFrame

评论