提问人:zenzo 提问时间:11/1/2023 更新时间:11/1/2023 访问量:57
Pyspark - 如何读取带有嵌套数组的 json 作为“column-row”或“key-value”
Pyspark - How to read json with nested arrays as "column-row" or "key-value"
问:
我有一个如下所示的json文件,我需要读取它并生成一个包含人员属性的表。
{
"person":[
[
"name",
"Guy"
],
[
"age",
"25"
],
[
"height",
"2.00"
]
]
}
名字 | 年龄 | 高度 |
---|---|---|
家伙 | 25 | 2.00 |
读取此 json 并输出表的最简单方法和性能方法是什么?
我正在考虑将列表转换为键值对,但由于我正在处理大量数据,因此信息量不足。
由于数据帧中的其他数据,我无法分解它。
答:
1赞
ARCrow
11/1/2023
#1
试试这个:
import pyspark.sql.functions as f
# get the fields that are going to show up for person
# './test_json.json' is the path for the json file.
fields = (
spark.read.option('multiLine', True).json('./test_json.json')
.select(f.expr('transform(person, element -> element[0])').alias('fields'))
.take(1)[0]['fields']
)
print(fields)
df = (
spark.read.option('multiLine', True).json('./test_json.json')
.withColumn('json_string', f.concat(
f.lit('{'),
f.concat_ws(',', f.expr("""transform(person, element -> concat_ws(":", concat("'", element[0], "'"), concat("'", element[1], "'")))""")),
f.lit('}')
)
)
.withColumn('json_content', f.from_json(f.col('json_string'), StructType([StructField(element, StringType(), True) for element in fields])))
.select('json_content.*')
)
df.show(truncate=False)
0赞
user238607
11/1/2023
#2
您可以在分解列后使用透视,如下所示。person
from pyspark import SQLContext
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
sc = SparkContext('local')
sqlContext = SQLContext(sc)
inputFile = "../data/jsonFlattenNested.json"
initialDF = sqlContext.read.json(inputFile)
initialDF.show(n=100, truncate=False)
initialDF = initialDF.withColumn("id", monotonically_increasing_id())
df_exploded = initialDF.withColumn("person_data", explode("person"))
df_exploded.show(n=100, truncate=False)
df_kv = df_exploded.select("id",
col("person_data").getItem(0).alias("attribute"),
col("person_data").getItem(1).alias("value")
)
df_kv.show(n=100, truncate=False)
df_final = df_kv.groupBy("id").pivot("attribute").agg(first("value"))
df_final.show(n=100, truncate=False)
输出 :
+----------------------------------------+
|person |
+----------------------------------------+
|[[name, Guy], [age, 25], [height, 2.00]]|
|[[name, Guy], [age, 25], [height, 2.00]]|
|[[name, Guy], [age, 25], [height, 2.00]]|
+----------------------------------------+
+----------------------------------------+---+--------------+
|person |id |person_data |
+----------------------------------------+---+--------------+
|[[name, Guy], [age, 25], [height, 2.00]]|0 |[name, Guy] |
|[[name, Guy], [age, 25], [height, 2.00]]|0 |[age, 25] |
|[[name, Guy], [age, 25], [height, 2.00]]|0 |[height, 2.00]|
|[[name, Guy], [age, 25], [height, 2.00]]|1 |[name, Guy] |
|[[name, Guy], [age, 25], [height, 2.00]]|1 |[age, 25] |
|[[name, Guy], [age, 25], [height, 2.00]]|1 |[height, 2.00]|
|[[name, Guy], [age, 25], [height, 2.00]]|2 |[name, Guy] |
|[[name, Guy], [age, 25], [height, 2.00]]|2 |[age, 25] |
|[[name, Guy], [age, 25], [height, 2.00]]|2 |[height, 2.00]|
+----------------------------------------+---+--------------+
+---+---------+-----+
|id |attribute|value|
+---+---------+-----+
|0 |name |Guy |
|0 |age |25 |
|0 |height |2.00 |
|1 |name |Guy |
|1 |age |25 |
|1 |height |2.00 |
|2 |name |Guy |
|2 |age |25 |
|2 |height |2.00 |
+---+---------+-----+
+---+---+------+----+
|id |age|height|name|
+---+---+------+----+
|0 |25 |2.00 |Guy |
|1 |25 |2.00 |Guy |
|2 |25 |2.00 |Guy |
+---+---+------+----+
jsonFlattenNested.json使用的输入文件包含以下内容。
{ "person":[ [ "name", "Guy" ], [ "age", "25" ], [ "height", "2.00" ] ]}
{ "person":[ [ "name", "Guy" ], [ "age", "25" ], [ "height", "2.00" ] ]}
{ "person":[ [ "name", "Guy" ], [ "age", "25" ], [ "height", "2.00" ] ]}
评论
0赞
zenzo
11/14/2023
谢谢,我选择了与此类似的解决方案!
评论