提问人:dja 提问时间:12/4/2022 最后编辑:Jason Allerdja 更新时间:12/11/2022 访问量:72
转换 HTML 到 JSON 使用 rdd.map
convert html to json using rdd.map
问:
我有html文件,我想在pySpark中解析。
例:
<MainStruct Rank="1">
<Struct Name="A">
<Struct Name="AA">
<Struct Name="AAA">
<Field Name="F1">Data</Field>
</Struct>
<Struct Name="ListPart">
<List Name="ListName">
<Struct Name="S1">
<Field Name="F1">AAA</Field>
<Field Name="F2">BBB</Field>
<Field Name="F3">CCC</Field>
</Struct>
<Struct Name="S1">
<Field Name="F1">XXX</Field>
<Field Name="F2">GGG</Field>
<Field Name="F3">BBB</Field>
</Struct>
</List>
</Struct>
</Struct>
</Struct>
</FullStudy>
rdd_html = spark.sparkContext.wholeTextFiles(path_to_XML, minPartitions=1000, use_unicode=True)
df_html = spark.createDataFrame(rdd_html,['filename', 'content'])
rdd_map = df_html.rdd.map(lambda x: xmltodict(x['content'],'mainstruct'))
df_map = spark.createDataFrame(rdd_map)
df_map.display()
但是在我的笔记本输出中,我对列表元素有问题。它们被直接解析。
>object
>AA:
>ListPart:
ListName: "[{S1={F1=AAA, F2=BBB, F3=CCC}}, {S1={F1=XXX, F2=GGG, F3=BBB}}]"
>AAA:
F1: "Data"
List 元素表示为一行字符串。
我的函数来解析它:
def xmltodict(content,first_tag=''):
#Content from xml File
content = re.sub('\n', '', content)
content = re.sub('\r', '', content)
content = re.sub('>\s+<', '><', content)
data = unicodedata.normalize('NFKD', content)
soup = BeautifulSoup(data, 'lxml')
body = soup.find('body')
if(first_tag.strip()!=''):
struct = body.find(first_tag)
else:
struct=body
return parser(struct)
def parser(struct):
struct_all = struct.findAll(True, recursive=False)
struct_dict = {}
for strc in struct_all:
tag = strc.name
tag_name_prop = strc.attrs['name']
if tag == 'struct':
d = parser(strc)
el = {tag_name_prop: d}
struct_dict.update(el)
elif tag == 'field':
v = strc.text
struct_dict[tag_name_prop] = v
elif tag == 'list':
l_elem = []
for child in strc.contents:
soap_child = BeautifulSoup(str(child), 'lxml').find('body')
l_elem.append(parser(soap_child))
el = {tag_name_prop: l_elem}
struct_dict.update(el)
with open (result.txt,'w') as file:
file.write(json.dumps(struct_dict))
return struct_dict
txt 文件中的结果是我想收到:
"A": { "AA": {
"AAA": {"F1": "Data"},
"ListPart": {
"ListName": [
{
"S1": {"F1": "AAA",
"F2": "BBB",
"F3": "CCC"
}
},
{
"S1": { "F1": "XXX",
"F2": "GGG",
"F3": "BBB"
}}]
}}}
但是在我的笔记本输出中,我对列表元素有问题。它们被直接解析。
>object
>AA:
>ListPart:
ListName: "[{S1={F1=AAA, F2=BBB, F3=CCC}}, {S1={F1=XXX, F2=GGG, F3=BBB}}]"
>AAA:
F1: "Data"
为什么列表表示为一行字符串?为什么有“=”符号而不是“:”?
答:
0赞
dja
12/4/2022
#1
我把这个问题简化为:
def parseList(row):
d = {}
d['el1']='AAA'
l = [{'x1':'XA'},{'x1':'XB'}]
d['el2']=l
return Row(res=d)
rdd_html = spark.sparkContext.wholeTextFiles(path_to_file_test, minPartitions=1000, use_unicode=True)
df_html = spark.createDataFrame(rdd_html,['filename', 'content'])
rdd_map = df_html.rdd.map(parseList2)
df_map = spark.createDataFrame(rdd_map)
df_map.display()
结果我也有
>object
el2: "[{x1=XA}, {x1=XB}]"
el1: "AAA"
不是那个
>object
>el2
x1:"XA"
x1:"XB"
el1: "AAA"
0赞
dja
12/11/2022
#2
我终于解决了我的问题。 原因是我应该定义架构并使用它。
df_map = spark.createDataFrame(rdd_map,schema)
评论