需要帮助提取以下数据

Need help to extract following data

提问人:prudhvi 提问时间:10/27/2023 最后编辑:Tim Robertsprudhvi 更新时间:10/27/2023 访问量:46

问:

输入文字 :

05/08/04 OUTPT LABORATORY 35.00 35.00 35.00 0.00 0.00 0.00
80053
05/10/04 OFFICE MEDICAL 14.00 9.50 0.00 0.00 9.50 0.00 1
84436
05/10/04- HOME MED EQUIP 32.00 32.00 32.00 0.00 0.00 0.00
05/13/04 A4595RR
05/10/04- HOME MED EQUIP 10.00 3.75 0.00 0.00 3.75 0.00 1
05/13/04 L3800RR
05/14/04 PHYSIOTHERAPY 23.00 23.00 7.00 0.00 16.00 0.00
97110
05/14/04 PHYSIOTHERAPY 14.00 9.00 0.00 0.00 9.00 0.00 1
97140

excel 中的预期输出格式如下:

服务日期 服务类型 程序编号 计费金额 允许的金额 我们支付的金额 不包括在内 扣除 共付额
05/08/04 OUTPT实验室 80053 35 35 35 0 0 0
05/10/04 办公室医疗 84436 14 9.5 0 0 9.5 0
05/10/04- 05/13/04 家用医疗设备 A4595RR 32 32 32 0 0 0
05/10/04- 05/13/04 家用医疗设备 L3800RR 10 3.75 0 0 3.75 0
05/14/04 物理疗法 97110 23 23 7 0 16 0
05/14/04 物理疗法 97140 14 9 0 0 9 0

问题:我能够提取,但问题是这些数据格式的顺序不完全正确 在某些情况下可能有两个日期,只有一种日期格式?有人可以帮我吗?"05/10/04- HOME MED EQUIP 32.00 32.00 32.00 0.00 0.00 0.00 05/13/04 A4595RR"

#Python

我有文本文件作为输入,一旦我读取文件,我必须识别我能够实现的上述输入,但现在我的数据不一致,但需要读取上面的输入并加载类似于下面输出的 excel 文件

注意:有些日期是动态的,有时不可用。

试过这个,但只能读取前两行,也缺少信息

import pandas as pd
import re

input_file_path = r'C:\Users\test\Downloads\PracticalAssessmentFiles\Input.txt'

output_file_path = r'C:\Users\test\Downloads\PracticalAssessmentFiles\output.xlsx'

with open(input_file_path, 'r') as file:
    input_string = file.read()

pattern = r'(\d{2}/\d{2}/\d{2,4}(?:\s*-\s*\d{2}/\d{2}/\d{2,4})?)\s+([\w\s]+)\s+([\dA-Z]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)'
matches = re.findall(pattern, input_string)

data = []
for match in matches:
    service_date, service_type, procedure_number, billed, allowed, paid, non_covered, deductible, copayment = match
    data.append({
        "Service Date": service_date,
        "Type of Service": service_type,
        "Procedure Number": procedure_number,
        "Amount Billed": billed,
        "Amount Allowed": allowed,
        "Amount We Paid": paid,
        "Non Covered": non_covered,
        "Deductible": deductible,
        "Copayment": copayment
    })

df = pd.DataFrame(data)

df.to_excel(output_file_path, index=False)

print(f"Data has been processed and saved to {output_file_path}")
python-3.x 正则表达式

评论

0赞 pho 10/27/2023
请考虑将您的标题编辑为总结您面临的实际问题的内容 如何提问.此外,格式化使您的问题更具可读性

答:

0赞 Tim Roberts 10/27/2023 #1

这符合您的要求。也许这可以成为您扩展的良好基础。

import re

oddlines = re.compile(r"([0-9/-]*) ([A-Z ]*) ([0-9. ]*)")

for num,line in enumerate(open('x.txt')):
    if num % 2 == 0:
        parts = oddlines.match(line)
        dt, title, nums = parts.groups()
        nums = nums.split()
    else:
        if dt[-1] == '-':
            dt1,proc = line.split()
            dt += dt1
        else:
            proc = line.strip()
        row = [dt, title, proc] + nums[:5]
        print(','.join(row))

输出:

05/08/04,OUTPT LABORATORY,80053,35.00,35.00,35.00,0.00,0.00
05/10/04,OFFICE MEDICAL,84436,14.00,9.50,0.00,0.00,9.50
05/10/04-05/13/04,HOME MED EQUIP,A4595RR,32.00,32.00,32.00,0.00,0.00
05/10/04-05/13/04,HOME MED EQUIP,L3800RR,10.00,3.75,0.00,0.00,3.75
05/14/04,PHYSIOTHERAPY,97110,23.00,23.00,7.00,0.00,16.00
05/14/04,PHYSIOTHERAPY,97140,14.00,9.00,0.00,0.00,9.00