使用 pyspark 解析 .edi 文件

Parsing .edi file with pyspark

提问人:JaniH 提问时间:10/13/2023 更新时间:10/14/2023 访问量:51

问:

我正在尝试使用pyspark解析.edi文件。 我使用以下命令将文件加载到spark_df:

spark_df = spark.read.csv(adls_path)

我得到

enter image description here

如何使用PySpark解析_c0列,以便两行中的每个元素都在自己的行上(它是str,分隔符是')?

字符串 数据帧 解析 PySpark EDI

评论


答:

0赞 user238607 10/14/2023 #1

您可以使用以下 python 库将 EDI 文件解析为字符串格式。

https://pypi.org/project/pydifact/

pip install pydifact

我提供了一个简单的代码示例,用于将 EDI 文件从以下位置解析为其字符串格式。

从以下位置使用的示例 EDI 文件: https://github.com/smooks/unedifact-examples/tree/master/splitting-camel/sample-data

from pydifact.segmentcollection import Interchange


input_dir = "../data/edi-files"

for edi_file in pathlib.Path(input_dir).glob('*.edi'):

    print("EDI file processing", edi_file)
    interchange = Interchange.from_file(str(edi_file))

    for message in interchange.get_messages():
        for segment in message.segments:
            print('Segment tag: {}, content: {}'.format(
                segment.tag, segment.elements))

    print("File processing", edi_file, "ended")

您可以将上述函数转换为 python udf,然后调用上述函数并获取原始字符串表示。_c0

如何将简单的 python 函数作为 udf 调用的示例用法。https://stackoverflow.com/a/34804340/3238085

以下是与上述 python 库执行相同操作的 java 库。

第一:

https://github.com/smooks/smooks/tree/master
Maven coordinates : org.smooks:smooks-core:2.0.0-RC2
Allows you to convert : EDI TO XML and then from XML to CSV

第二:

https://github.com/BerryWorksSoftware/edireader
Maven Coordinates : com.berryworks:edireader:5.6.4
Allows you to convert EDI TO XML

第三:

https://github.com/BerryWorksSoftware/edi-json/tree/master/repo/com/berryworks/edireader-json-basic/5.6.2
You can download the jar from this location
Allows you to convert EDI TO JSON 

如果您决定使用上述 jar,以下是如何在 pyspark 中从上述 jar 调用 java 函数的示例用法

在 PySpark 中运行自定义 Java 类

顶层python脚本的输出:

EDI file processing ../data/edi-files/DESADV.edi
Segment tag: BGM, content: ['351', '19960445', '4', 'NA']
Segment tag: DTM, content: [['137', '199610180800', '203']]
Segment tag: DTM, content: [['69', '19961020', '102']]
Segment tag: RFF, content: [['ON', '1996100001']]
Segment tag: NAD, content: ['BY', ['7080000043217', '', '9']]
Segment tag: NAD, content: ['SU', ['7080000083121', '', '9']]
Segment tag: RFF, content: [['VA', 'FORETAKSREGISTERET NO987654321MVA']]
Segment tag: CTA, content: ['AD', ['', 'Hans Hansen']]
Segment tag: NAD, content: ['DP', ['7080000083122', '', '9']]
Segment tag: TOD, content: ['4', '', 'DD2']
Segment tag: CPS, content: ['1']
Segment tag: PAC, content: ['1', ['', '50'], '201']
Segment tag: MEA, content: ['PD', ['AAD', '3'], ['KGM', '2']]
Segment tag: HAN, content: [['FTD', '', '9']]
Segment tag: PCI, content: ['30E']
Segment tag: GIN, content: ['SS', '170325200000000185']
Segment tag: LIN, content: ['1', '', ['7037660000197', 'EN']]
Segment tag: PIA, content: ['1', ['12345', 'SA', '', '91']]
Segment tag: IMD, content: ['C', '', 'TU']
Segment tag: IMD, content: ['F', '', ['', '', '', 'HVETEMEL']]
Segment tag: QTY, content: [['12', '14']]
Segment tag: QTY, content: [['59', '6']]
Segment tag: RFF, content: [['ON', '19961198']]
Segment tag: PCI, content: ['30E']
Segment tag: DTM, content: [['137', '199610180800', '203']]
Segment tag: CNT, content: [['2', '1']]
File processing ../data/edi-files/DESADV.edi ended
EDI file processing ../data/edi-files/invoic-d93a.edi
Segment tag: BGM, content: ['380', '891206500']
Segment tag: DTM, content: [['137', '20100926', '102']]
Segment tag: NAD, content: ['II', ['SSESDL', '', '87']]
Segment tag: RFF, content: [['VA', 'SE5562503630']]
Segment tag: RFF, content: [['GN', '00075562503630']]
Segment tag: NAD, content: ['IV', ['33426776', '', '87']]
Segment tag: RFF, content: [['VA', '  SE5565268538']]
Segment tag: NAD, content: ['PE', '', ['SCHENKER AB', '412 97 GÖTEBORG']]
Segment tag: RFF, content: [['BGI', '9423047']]
Segment tag: RFF, content: [['PGI', '9423047']]
Segment tag: CUX, content: [['2', 'SEK', '10'], ['3', 'SEK', '11']]
Segment tag: PAT, content: ['3', '', '66']
Segment tag: DTM, content: [['13', '20101006', '102']]
Segment tag: PAT, content: ['20', '', ['66', '', 'M']]
Segment tag: PCD, content: [['15', '1.8', '13']]
Segment tag: LIN, content: ['1']
Segment tag: MEA, content: ['PD', 'AAD', ['KGM', '177']]
Segment tag: MEA, content: ['PD', 'VOL', ['MTQ', '0.864']]
Segment tag: QTY, content: [['100', '288', 'KGM']]
Segment tag: DTM, content: [['143', '20100927', '102']]
Segment tag: MOA, content: [['203', '1736']]
Segment tag: RFF, content: [['FF', 'SDL3116575']]
Segment tag: RFF, content: [['AAS', 'DDT 38']]
Segment tag: PAC, content: ['1']
Segment tag: LOC, content: ['5', ['20060', '16', '', 'MILANO']]
Segment tag: LOC, content: ['8', ['88152', '16', '', 'STOCKHOLM']]
Segment tag: LOC, content: ['35', ['IT', '162']]
Segment tag: LOC, content: ['28', ['SE', '162']]
Segment tag: NAD, content: ['CN', '', '', 'TT THERMOTECH SCANDINAVIA AB', ['BOX 69', 'NIPAN 59'], '881 22  SOLLEFTEÅ', '', '88122']
Segment tag: NAD, content: ['DP', '', '', 'TT THERMOTECH SCANDINAVIA', 'NIPAN 59', 'SOLLEFTEÅ', '', '88152']
Segment tag: ALC, content: ['C', '', '6', '', ['553', '', '87', 'FRAKT']]
Segment tag: MOA, content: [['8', '1233']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '308.25']]
Segment tag: ALC, content: ['C', '', '6', '', ['586', '', '87', 'VÄGSKATT TYSKLAND']]
Segment tag: MOA, content: [['8', '27']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '6.75']]
Segment tag: ALC, content: ['C', '', '6', '', ['735', '', '87', 'EXPEDITIONSAVGIFT INFÖRSEL']]
Segment tag: MOA, content: [['8', '260']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '65']]
Segment tag: ALC, content: ['C', '', '6', '', ['572', '', '87', 'DRIVMEDELSJUSTERING']]
Segment tag: MOA, content: [['8', '162']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '40.5']]
Segment tag: ALC, content: ['C', '', '6', '', ['573', '', '87', 'VALUTAJUSTERING']]
Segment tag: MOA, content: [['8', '54']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['124', '13.5']]
Segment tag: TDT, content: ['20', '', '3', '', '', '', '', ['', '', '', 'TCO4040']]
Segment tag: UNS, content: ['S']
Segment tag: MOA, content: [['9', '2170']]
Segment tag: MOA, content: [['125', '1736']]
Segment tag: MOA, content: [['176', '434']]
Segment tag: TAX, content: ['7', 'VAT', '', '', ['', '', '', '25'], 'S']
Segment tag: MOA, content: [['125', '1736']]
Segment tag: MOA, content: [['176', '434']]
File processing ../data/edi-files/invoic-d93a.edi ended

评论

0赞 JaniH 10/16/2023
我正在使用 Azure Synapse 笔记本,因此最好使用 spark 数据帧。可以通过火花来完成吗?
0赞 user238607 10/16/2023
是的,您可以将任何 python 函数包装到 UDF 中,并在您的列上调用该 UDF。