提问人:prudhvi 提问时间:10/27/2023 最后编辑:Tim Robertsprudhvi 更新时间:10/27/2023 访问量:46
需要帮助提取以下数据
Need help to extract following data
问:
输入文字 :
05/08/04 OUTPT LABORATORY 35.00 35.00 35.00 0.00 0.00 0.00
80053
05/10/04 OFFICE MEDICAL 14.00 9.50 0.00 0.00 9.50 0.00 1
84436
05/10/04- HOME MED EQUIP 32.00 32.00 32.00 0.00 0.00 0.00
05/13/04 A4595RR
05/10/04- HOME MED EQUIP 10.00 3.75 0.00 0.00 3.75 0.00 1
05/13/04 L3800RR
05/14/04 PHYSIOTHERAPY 23.00 23.00 7.00 0.00 16.00 0.00
97110
05/14/04 PHYSIOTHERAPY 14.00 9.00 0.00 0.00 9.00 0.00 1
97140
excel 中的预期输出格式如下:
服务日期 | 服务类型 | 程序编号 | 计费金额 | 允许的金额 | 我们支付的金额 | 不包括在内 | 扣除 | 共付额 |
---|---|---|---|---|---|---|---|---|
05/08/04 | OUTPT实验室 | 80053 | 35 | 35 | 35 | 0 | 0 | 0 |
05/10/04 | 办公室医疗 | 84436 | 14 | 9.5 | 0 | 0 | 9.5 | 0 |
05/10/04- 05/13/04 | 家用医疗设备 | A4595RR | 32 | 32 | 32 | 0 | 0 | 0 |
05/10/04- 05/13/04 | 家用医疗设备 | L3800RR | 10 | 3.75 | 0 | 0 | 3.75 | 0 |
05/14/04 | 物理疗法 | 97110 | 23 | 23 | 7 | 0 | 16 | 0 |
05/14/04 | 物理疗法 | 97140 | 14 | 9 | 0 | 0 | 9 | 0 |
问题:我能够提取,但问题是这些数据格式的顺序不完全正确 在某些情况下可能有两个日期,只有一种日期格式?有人可以帮我吗?"05/10/04- HOME MED EQUIP 32.00 32.00 32.00 0.00 0.00 0.00 05/13/04 A4595RR"
#Python
我有文本文件作为输入,一旦我读取文件,我必须识别我能够实现的上述输入,但现在我的数据不一致,但需要读取上面的输入并加载类似于下面输出的 excel 文件
注意:有些日期是动态的,有时不可用。
试过这个,但只能读取前两行,也缺少信息
import pandas as pd
import re
input_file_path = r'C:\Users\test\Downloads\PracticalAssessmentFiles\Input.txt'
output_file_path = r'C:\Users\test\Downloads\PracticalAssessmentFiles\output.xlsx'
with open(input_file_path, 'r') as file:
input_string = file.read()
pattern = r'(\d{2}/\d{2}/\d{2,4}(?:\s*-\s*\d{2}/\d{2}/\d{2,4})?)\s+([\w\s]+)\s+([\dA-Z]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)\s+([\d.]+)'
matches = re.findall(pattern, input_string)
data = []
for match in matches:
service_date, service_type, procedure_number, billed, allowed, paid, non_covered, deductible, copayment = match
data.append({
"Service Date": service_date,
"Type of Service": service_type,
"Procedure Number": procedure_number,
"Amount Billed": billed,
"Amount Allowed": allowed,
"Amount We Paid": paid,
"Non Covered": non_covered,
"Deductible": deductible,
"Copayment": copayment
})
df = pd.DataFrame(data)
df.to_excel(output_file_path, index=False)
print(f"Data has been processed and saved to {output_file_path}")
答:
0赞
Tim Roberts
10/27/2023
#1
这符合您的要求。也许这可以成为您扩展的良好基础。
import re
oddlines = re.compile(r"([0-9/-]*) ([A-Z ]*) ([0-9. ]*)")
for num,line in enumerate(open('x.txt')):
if num % 2 == 0:
parts = oddlines.match(line)
dt, title, nums = parts.groups()
nums = nums.split()
else:
if dt[-1] == '-':
dt1,proc = line.split()
dt += dt1
else:
proc = line.strip()
row = [dt, title, proc] + nums[:5]
print(','.join(row))
输出:
05/08/04,OUTPT LABORATORY,80053,35.00,35.00,35.00,0.00,0.00
05/10/04,OFFICE MEDICAL,84436,14.00,9.50,0.00,0.00,9.50
05/10/04-05/13/04,HOME MED EQUIP,A4595RR,32.00,32.00,32.00,0.00,0.00
05/10/04-05/13/04,HOME MED EQUIP,L3800RR,10.00,3.75,0.00,0.00,3.75
05/14/04,PHYSIOTHERAPY,97110,23.00,23.00,7.00,0.00,16.00
05/14/04,PHYSIOTHERAPY,97140,14.00,9.00,0.00,0.00,9.00
评论