如何使用“动态”模式在pyspark中解析宽松的JSON-解网

问：

我有一个数据集，其中一列包含一个看起来像“宽松”JSON 格式的字符串（键周围没有双引号）。

#1 我正在寻找一种在 pyspark 中解析它的方法 - 我尝试了 from_json + 模式，但由于它不是正确的 JSON 格式，所以这不起作用（所有嵌套列都导致 NULL 值）

#2 JSON 列中的键可能会根据用户回答的问题而更改。在下面的示例中，它是 question5 和 question7，但实际上此键可以更改，因此值的结构也会更改。如何处理这种“动态”格式？有人可以分享一个例子吗？

我的数据集如下所示：

customer_id	json_data
1	{ metadata： { version： 1. }， attributes： { question5： [ { provenance： { type： “confirmed”， value： { origin： “ABC”， timestamp： “2010-07-12T23：00：51Z” } }， value： { type： “struct”， value： { response： { type： “list”， value： [ { type： “stringValue”， value： “import value 1” } ] }， category： { type： “stringValue”， value： “job” } } } }， { provenance： { type： “confirmed”， value： { origin： “ABC”， timestamp： “2010-07-12T23：00：51Z” } }， value： { type： “struct”， value： { response： { type： “list”， value： [ { type： “stringValue”， value： “address 1” }， { type： “stringValue”， value： “address 2” } ] }， category： { type： “stringValue”， value： “address” } } } } ]， question7： [ { provenance： { type： “confirmed”， value： { origin： “XYZ”， timestamp： “2010-07-12T23：00：51Z” } }， value： { type： “stringValue”， value： “yes” } } ] } }

customer_id

json_data

1

{ metadata： { version： 1. }， attributes： { question5： [ { provenance： { type： “confirmed”， value： { origin： “ABC”， timestamp： “2010-07-12T23：00：51Z” } }， value： { type： “struct”， value： { response： { type： “list”， value： [ { type： “stringValue”， value： “import value 1” } ] }， category： { type： “stringValue”， value： “job” } } } }， { provenance： { type： “confirmed”， value： { origin： “ABC”， timestamp： “2010-07-12T23：00：51Z” } }， value： { type： “struct”， value： { response： { type： “list”， value： [ { type： “stringValue”， value： “address 1” }， { type： “stringValue”， value： “address 2” } ] }， category： { type： “stringValue”， value： “address” } } } } ]， question7： [ { provenance： { type： “confirmed”， value： { origin： “XYZ”， timestamp： “2010-07-12T23：00：51Z” } }， value： { type： “stringValue”， value： “yes” } } ] } }

我能够使用一系列带有 for 循环的json_normalize在 python 中解析数据集。但是，我使用的工具需要pyspark，因此存在上述问题。非常感谢您的时间和帮助！

python json pyspark fromjson convertfrom-json

如何使用“动态”模式在pyspark中解析宽松的JSON

How to parse relaxed JSON in pyspark with a "dynamic" schema

评论