提问人:PSt 提问时间:11/16/2023 更新时间:11/17/2023 访问量:37
读取信息并将其存储在特定列中
Reading information and storing it in specific columns
问:
我有一个 txt 文件,它是长度分隔的,看起来像:
| ID 12345678 John Doe |
| Type B123AAAA Entry1 |
| Category One User C_1234231122 |
| Date 01.02.2023 Time 06:17:01 |
| text 01.02.2023 31.12.2025 some text |
| ID 98732411 Johanna Doe |
| Type C123CCCC Entry2 |
| Category Two User C_2222222122 |
| Date 01.04.2023 Time 09:17:01 |
| text 01.02.2023 31.12.2027 some text2 |
(事实上,我有很多文件,相当大,所以有很多数据。但是有一个文件说明了这个问题,我只是在文件上放一个循环,用相同的代码处理每个文件,并将每个文件存储在数据帧中。
我想把信息放到一个数据帧中,结果应该是一个存储信息的熊猫数据帧,看起来像:
ID Name Type Text1 Category User Date Time Additiionalline From To Text2
12345678 John Doe B123AAAA Entry1 One C_1234231122 01.02.2023 06:17:01 text 01.02.2023 31.12.2025 some text
98732411 Johanna Doe C123CCCC Entry2 Two C_2222222122 01.04.2023 09:17:01 text 01.02.2023 31.12.2027 some text2
到目前为止,我的代码是:
import os
import pandas as pd
import csv
df = pd.DataFrame(columns=["ID", "Name", "Type"])
appended_data=[]
with open(r'pathtofile\test.txt', "r", encoding="utf-8") as f:
for line in f:
if line[:12]=="| ID ":
valueID = line[12:28].replace(" ","")
valueName = line[28:80].rstrip()
if line[:12]=="| Type ":
valueType = line[12:28].rstrip()
df_new=pd.DataFrame({"ID": valueID, "Name": valueName, "Type": valueType}, index=[0])
appended_data.append(df_new)
df_compl=pd.concat(appended_data)
print(df_compl)
我现在将继续其他列,例如 Text1 类别用户等。所以这只是前三列的一个例子。但在我遵循这种方法之前,我的问题是:我认为我的方法可能不是最好的。我怀疑这是否是一个好方法,也许有更好或更有效的方法可以做到这一点?
答:
1赞
Andrej Kesely
11/17/2023
#1
尝试:
import re
text = """\
| ID 12345678 John Doe |
| Type B123AAAA Entry1 |
| Category One User C_1234231122 |
| Date 01.02.2023 Time 06:17:01 |
| text 01.02.2023 31.12.2025 some text |
| ID 98732411 Johanna Doe |
| Type C123CCCC Entry2 |
| Category Two User C_2222222122 |
| Date 01.04.2023 Time 09:17:01 |
| text 01.02.2023 31.12.2027 some text2 |"""
groups = r"^\s*\|\s*(ID.*?)(?=\s*\|\s*ID|\Z)"
all_lines = []
for g in re.findall(groups, text, flags=re.S | re.M):
id_, name = re.search(r"ID\s*(\S+)\s*(.*?)\s*\|", g, flags=re.M).groups()
type_, text1 = re.search(r"Type\s*(\S+)\s*(\S+)", g).groups()
category, user = re.search(r"Category\s*(\S+)\s*User\s*(\S+)", g).groups()
date, time = re.search(r"Date\s*(\S+)\s*Time\s*(\S+)", g).groups()
additional_line = re.search(r"text\s*(.*?)\s*\|", g).group(1)
all_lines.append(
[id_, name, type_, text1, category, user, date, time, additional_line]
)
df = pd.DataFrame(
all_lines,
columns="id name type text1 category user date time additional_line".split(),
)
print(df)
指纹:
id name type text1 category user date time additional_line
0 12345678 John Doe B123AAAA Entry1 One C_1234231122 01.02.2023 06:17:01 01.02.2023 31.12.2025 some text
1 98732411 Johanna Doe C123CCCC Entry2 Two C_2222222122 01.04.2023 09:17:01 01.02.2023 31.12.2027 some text2
评论