Pyspark - 如何读取带有嵌套数组的 json 作为“column-row”或“key-value”-解网

问：

我有一个如下所示的json文件，我需要读取它并生成一个包含人员属性的表。

{
  "person":[
      [
      "name",
      "Guy"
      ],
      [
      "age",
      "25"
      ],
      [
       "height",
       "2.00"
      ]
  ]
}

名字	年龄	高度
家伙	25	2.00

读取此 json 并输出表的最简单方法和性能方法是什么？

我正在考虑将列表转换为键值对，但由于我正在处理大量数据，因此信息量不足。

由于数据帧中的其他数据，我无法分解它。

数组 JSON pyspark 嵌套动态数组

import pyspark.sql.functions as f

# get the fields that are going to show up for person
# './test_json.json' is the path for the json file.

fields = (
    spark.read.option('multiLine', True).json('./test_json.json')
    .select(f.expr('transform(person, element -> element[0])').alias('fields'))
    .take(1)[0]['fields']
)
print(fields)

df = (
    spark.read.option('multiLine', True).json('./test_json.json')
    .withColumn('json_string', f.concat(
            f.lit('{'),
            f.concat_ws(',', f.expr("""transform(person, element -> concat_ws(":", concat("'", element[0], "'"), concat("'", element[1], "'")))""")),
            f.lit('}')
        )
    )
    .withColumn('json_content', f.from_json(f.col('json_string'), StructType([StructField(element, StringType(), True) for element in fields])))
    .select('json_content.*')
)
df.show(truncate=False)

0赞 user238607 11/1/2023 #2

您可以在分解列后使用透视，如下所示。person

from pyspark import SQLContext
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode

sc = SparkContext('local')
sqlContext = SQLContext(sc)


inputFile = "../data/jsonFlattenNested.json"

initialDF = sqlContext.read.json(inputFile)
initialDF.show(n=100, truncate=False)

initialDF = initialDF.withColumn("id", monotonically_increasing_id())
df_exploded = initialDF.withColumn("person_data", explode("person"))

df_exploded.show(n=100, truncate=False)
df_kv = df_exploded.select("id",
    col("person_data").getItem(0).alias("attribute"),
    col("person_data").getItem(1).alias("value")
)


df_kv.show(n=100, truncate=False)

df_final = df_kv.groupBy("id").pivot("attribute").agg(first("value"))
df_final.show(n=100, truncate=False)

输出：

+----------------------------------------+
|person                                  |
+----------------------------------------+
|[[name, Guy], [age, 25], [height, 2.00]]|
|[[name, Guy], [age, 25], [height, 2.00]]|
|[[name, Guy], [age, 25], [height, 2.00]]|
+----------------------------------------+

+----------------------------------------+---+--------------+
|person                                  |id |person_data   |
+----------------------------------------+---+--------------+
|[[name, Guy], [age, 25], [height, 2.00]]|0  |[name, Guy]   |
|[[name, Guy], [age, 25], [height, 2.00]]|0  |[age, 25]     |
|[[name, Guy], [age, 25], [height, 2.00]]|0  |[height, 2.00]|
|[[name, Guy], [age, 25], [height, 2.00]]|1  |[name, Guy]   |
|[[name, Guy], [age, 25], [height, 2.00]]|1  |[age, 25]     |
|[[name, Guy], [age, 25], [height, 2.00]]|1  |[height, 2.00]|
|[[name, Guy], [age, 25], [height, 2.00]]|2  |[name, Guy]   |
|[[name, Guy], [age, 25], [height, 2.00]]|2  |[age, 25]     |
|[[name, Guy], [age, 25], [height, 2.00]]|2  |[height, 2.00]|
+----------------------------------------+---+--------------+

+---+---------+-----+
|id |attribute|value|
+---+---------+-----+
|0  |name     |Guy  |
|0  |age      |25   |
|0  |height   |2.00 |
|1  |name     |Guy  |
|1  |age      |25   |
|1  |height   |2.00 |
|2  |name     |Guy  |
|2  |age      |25   |
|2  |height   |2.00 |
+---+---------+-----+

+---+---+------+----+
|id |age|height|name|
+---+---+------+----+
|0  |25 |2.00  |Guy |
|1  |25 |2.00  |Guy |
|2  |25 |2.00  |Guy |
+---+---+------+----+

jsonFlattenNested.json使用的输入文件包含以下内容。

{  "person":[      [      "name",      "Guy"      ],      [      "age",      "25"      ],      [       "height",       "2.00"      ]  ]}
{  "person":[      [      "name",      "Guy"      ],      [      "age",      "25"      ],      [       "height",       "2.00"      ]  ]}
{  "person":[      [      "name",      "Guy"      ],      [      "age",      "25"      ],      [       "height",       "2.00"      ]  ]}

Pyspark - 如何读取带有嵌套数组的 json 作为“column-row”或“key-value”

Pyspark - How to read json with nested arrays as "column-row" or "key-value"

评论

评论