提问人:SunflowerParty 提问时间:3/22/2023 最后编辑:Dipanjan MallickSunflowerParty 更新时间:3/24/2023 访问量:219
使用自动 select 语句将 pyspark 数组解析为列
Parse pyspark array into columns using automated select statement
问:
我想将我的数据帧解析到下面列表中的列中。我有两个数据帧:一个模式数据帧包含我将使用的列名,另一个数据格式为数组行。pyspark
array_col
即
schema:
cols = ['Brand', 'Price', 'Sales', 'Timestamp']
DF:
+----------------------------------------------+
|array_col |
+----------------------------------------------+
|["volvo", "$20K", "2", "12/6/2022 5:15:46 PM"]|
|["bmw", "$40K", "1", "11/6/2022 6:15:76 PM"] |
+----------------------------------------------+
我可以将数组解析为它们各自的列,并在下面硬编码列数和列名,但我想自动执行我的脚本以创建下面的语句,以便我可以将此脚本用于其他数据帧,而无需手动编码列数和列名。select
newdf = df.select(df.array_col[0], df.array_col[1], df.array_col[2], df.array_col[3]).toDF(*['Brand', 'Price', 'Sales', 'Timestamp']).display()
我尝试创建一个字符串以使用格式来替换语句,但该语句不是字符串,因此这不起作用。有没有办法自动执行这些值,或者有更好的方法将数组解析为列?f string
df.select
newdf_str = []
for i in range(len(cols)):
newdf_str.append('newdf.array_col['+str(i)+']')
newdf_str = ', '.join(newdf_str)
提前感谢您的帮助。
答:
0赞
Hoang Minh Quang FX15045
3/22/2023
#1
您可以使用“*”解压缩列表中的元素,如下所示
from pyspark.sql import SparkSession
from pyspark.sql.functions import array
from pyspark.sql.functions import col
# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
# Original list contain column names
cols = ['Brand', 'Price', 'Sales', 'Timestamp']
# Create the original dataframe with a single column named `array_col`
df = spark.createDataFrame([
(["volvo", "$20K", "2", "12/6/2022 5:15:46 PM"],),
(["bmw", "$40K", "1", "11/6/2022 6:15:76 PM"],)],
["array_col"])
df.show(truncate=False)
# +--------------------------------------+
# |array_col |
# +--------------------------------------+
# |[volvo, $20K, 2, 12/6/2022 5:15:46 PM]|
# |[bmw, $40K, 1, 11/6/2022 6:15:76 PM] |
# +--------------------------------------+
# Dynamically create DF from cols and df
newdf = df.select(*[col("array_col")[i].alias(cols[i]) for i in range(len(cols))])
newdf.show(truncate=False)
# +-----+-----+-----+--------------------+
# |Brand|Price|Sales|Timestamp |
# +-----+-----+-----+--------------------+
# |volvo|$20K |2 |12/6/2022 5:15:46 PM|
# |bmw |$40K |1 |11/6/2022 6:15:76 PM|
# +-----+-----+-----+--------------------+
评论
schema dataframe
cols = ['Brand', 'Price', 'Sales', Timestamp']