转换 HTML 到 JSON 使用 rdd.map

convert html to json using rdd.map

提问人:dja 提问时间:12/4/2022 最后编辑:Jason Allerdja 更新时间:12/11/2022 访问量:72

问:

我有html文件,我想在pySpark中解析。

例:

<MainStruct Rank="1">
  <Struct Name="A">
    <Struct Name="AA">
      <Struct Name="AAA">
        <Field Name="F1">Data</Field>
      </Struct>
      <Struct Name="ListPart">
        <List Name="ListName">
          <Struct Name="S1">
            <Field Name="F1">AAA</Field>
            <Field Name="F2">BBB</Field>
            <Field Name="F3">CCC</Field>
          </Struct>
          <Struct Name="S1">
            <Field Name="F1">XXX</Field>
            <Field Name="F2">GGG</Field>
            <Field Name="F3">BBB</Field>
          </Struct>
        </List>
      </Struct>
    </Struct>
  </Struct>
</FullStudy>
rdd_html = spark.sparkContext.wholeTextFiles(path_to_XML, minPartitions=1000, use_unicode=True)
df_html = spark.createDataFrame(rdd_html,['filename', 'content'])
rdd_map = df_html.rdd.map(lambda x: xmltodict(x['content'],'mainstruct'))
df_map = spark.createDataFrame(rdd_map)

df_map.display()

但是在我的笔记本输出中,我对列表元素有问题。它们被直接解析。

>object
     >AA: 
       >ListPart: 
         ListName: "[{S1={F1=AAA, F2=BBB, F3=CCC}}, {S1={F1=XXX, F2=GGG, F3=BBB}}]"
     >AAA: 
        F1: "Data"

List 元素表示为一行字符串。

我的函数来解析它:

def xmltodict(content,first_tag=''):
   
    #Content from xml File
    content = re.sub('\n', '', content)
    content = re.sub('\r', '', content)
    content = re.sub('>\s+<', '><', content)

    data = unicodedata.normalize('NFKD', content)
    soup = BeautifulSoup(data, 'lxml')

    body = soup.find('body')

    if(first_tag.strip()!=''):
        struct = body.find(first_tag)
    else:
        struct=body

    return parser(struct)

def parser(struct):
    struct_all = struct.findAll(True, recursive=False)

    struct_dict = {}
    for strc in struct_all:
        tag = strc.name
        tag_name_prop = strc.attrs['name']  
    
        if tag == 'struct':
            d = parser(strc)
            el = {tag_name_prop: d}
            struct_dict.update(el)
        elif tag == 'field':
            v = strc.text
            struct_dict[tag_name_prop] = v
        elif tag == 'list':
            l_elem = []
            for child in strc.contents:
                soap_child = BeautifulSoup(str(child), 'lxml').find('body')
                l_elem.append(parser(soap_child))
                el = {tag_name_prop: l_elem}
                struct_dict.update(el)
    
    with open (result.txt,'w') as file:
        file.write(json.dumps(struct_dict))
        
    return struct_dict

txt 文件中的结果是我想收到:

"A": {   "AA": {
            "AAA": {"F1": "Data"},
             "ListPart": {
                "ListName": [
                    {
                        "S1": {"F1": "AAA",
                            "F2": "BBB",
                            "F3": "CCC" 
                                                       }
                    },
                    {
                        "S1": { "F1": "XXX",
                            "F2": "GGG",
                            "F3": "BBB"
                        }}]
            }}}

但是在我的笔记本输出中,我对列表元素有问题。它们被直接解析。

>object
     >AA: 
       >ListPart: 
         ListName: "[{S1={F1=AAA, F2=BBB, F3=CCC}}, {S1={F1=XXX, F2=GGG, F3=BBB}}]"
     >AAA: 
        F1: "Data"

为什么列表表示为一行字符串?为什么有“=”符号而不是“:”?

pyspark xml html 解析 rdd

评论


答:

0赞 dja 12/4/2022 #1

我把这个问题简化为:

    def parseList(row):
        d = {}
        d['el1']='AAA'
        l = [{'x1':'XA'},{'x1':'XB'}]
        d['el2']=l
        return Row(res=d)
rdd_html = spark.sparkContext.wholeTextFiles(path_to_file_test, minPartitions=1000, use_unicode=True)
df_html = spark.createDataFrame(rdd_html,['filename', 'content'])
rdd_map = df_html.rdd.map(parseList2)
df_map = spark.createDataFrame(rdd_map)
df_map.display()

结果我也有

>object
    el2: "[{x1=XA}, {x1=XB}]"
    el1: "AAA"

不是那个

>object
   >el2 
     x1:"XA"
     x1:"XB"
    el1: "AAA"
0赞 dja 12/11/2022 #2

我终于解决了我的问题。 原因是我应该定义架构并使用它。

df_map = spark.createDataFrame(rdd_map,schema)