提问人:RunTheGauntlet 提问时间:11/16/2023 最后编辑:RunTheGauntlet 更新时间:11/16/2023 访问量:80
UserWarning:createDataFrame 尝试在 pyspark createDataFrame 中进行箭头优化
UserWarning: createDataFrame attempted Arrow optimization in pyspark createDataFrame
问:
在具有运行时 12.2 LTS ML(包括 Apache Spark 3.3.2、Scala 2.12)的 Azure DataBricks 中,我正在尝试运行以下脚本:
import pandas as pd
example = pd.DataFrame([{'a':[{'b':'c'}]}])
from pyspark.sql.types import *
schema = StructType([
StructField("a", ArrayType(MapType(StringType(),StringType())), True),
])
query_df = spark.createDataFrame(example, schema)
display(query_df)
代码执行将返回以下警告:
/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Could not convert {'b': 'c'} with type dict: was not a sequence or recognized null for conversion to list type
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
尽管多次尝试使用各种架构变体,但我还是不断收到此警告。
a) 为什么它试图将 dict 转换为列表?
b) 如何使其正常工作并进行优化,同时在此列中将“字典列表”作为数据类型?
[编辑] 集群配置: 个人计算 12.2 LTS ML(包括 Apache Spark 3.3.2、Scala 2.12) Standard_DS3_v2 Spark 配置:spark.databricks.cluster.profile singleNode spark.master 本地[*, 4]
答:
1赞
JayashankarGS
11/16/2023
#1
您得到的错误是由于数据类型。
以下是基于箭头的转换不支持的数据类型:
MapType
、 和嵌套 。ArrayType
TimestampType
StructType
因此,如果您提供并使用它进行转换,则会出现错误。MapType
即使对于嵌套的.StructType
import pandas as pd
example = pd.DataFrame([{'a':[{'b':{'c':'d'}}]}])
from pyspark.sql.types import *
schema = StructType([
StructField("a", StructType([
StructField("b", StructType([
StructField("c", StringType(), nullable=True)
]), nullable=True)
]), nullable=True)
])
query_df2 = spark.createDataFrame(example, schema)
query_df2.toPandas()
错误:
/databricks/spark/python/pyspark/sql/pandas/conversion.py:467: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warn(msg)
/databricks/spark/python/pyspark/sql/pandas/conversion.py:122: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Unable to convert the field a. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Direct cause: Nested StructType not supported in conversion to Arrow
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
以下是带和不带 .MapType
query_df1 = spark.createDataFrame(example)
query_df2 = spark.createDataFrame(example, schema)
query_df2.printSchema()
query_df1.printSchema()
输出:
root
|-- a: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b: string (nullable = true)
因此,请使用受支持的架构。 有关更多信息,请参阅此文档。
评论