提问人:Java 提问时间:10/28/2023 最后编辑:Java 更新时间:10/28/2023 访问量:81
如何使用第一列中的值从 csv 文件中提取某些数据块
How to extract certain blocks of data from a csv file using values in the first column
问:
我试图从第一列的关键字中从第三行和第二列开始,仅提取两个部分(“入院”和“再入院”)中的数据。
请参阅底部的 csv 文件中的数据集示例。
Admissions
Not Started:12 Sent:3 Completed:3
Division Community ResidentName Date DocumentStatus Last Update
Test Station Jane Doe 9/12/2023 Sent 9/12/2023
Test Station 2 John Doe 9/12/2023 Not Started
Alibaba Fizgerald Super Man 9/12/2023 Not Started
Iceland Kingdom Super Woman 9/12/2023 Not Started
,,,,,
Readmissions
Not Started:1 Sent:0 Completed:1
Division Community Resident Name Date DocumentStatus Last Update
Station Kingdom Pretty Woman 9/12/2023 Not Started
My Goodness Ugly Man 7/21/2023 Completed 7/26/2023
,,,,
Discharge
Division Community Resident Name Date
Station Kingdom1 Pretty Woman2 8/22/2023
My Goodness1 Ugly Man1 4/8/2023
Landmark2 Nice Guys 9/12/2023
Iceland Kingdom2 Mr. Heroshi2 7/14/2023
More Kingdom 2 King Kong 8/31/2023
所以,逻辑是:
找到包含 column1 = 'Admissions' 或 'Readmissions' 的数据的行。
向下走 3 行,向右走 1 列。 获取 5 列的所有数据,直到它到达没有数据的行。
因此,我想将蓝框部分作为每个部分的输出:
我从上一篇文章中得到了这个代码,但几周来我都无法解决它/获得解决方案,所以我发布了这个新代码。
import re
import pandas as pd
from io import StringIO
def read_block(names, igidx=True):
with open("Test1.csv") as f:
pat = r"(\w+),+$\n+(.+?)(?=\n\w+,+\n$|\Z)"
return pd.concat([
pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
.iloc[:, 1:].dropna(how="all") for m in re.finditer(
pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
], keys=names, ignore_index=igidx)
df = read_block(names=["Admissions"])
print(df)
底部是电流输出,但它连接了所有三个部分。
我想在第一列中使用关键字(“Admissions”、“Readmissions”)作为代码中的变量进行单独的输出。
Sent: 3 Completed: 3 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 Community Resident Name Date Document Status Last Update
1 Test Station Jane Doe 9/12/2023 Sent 9/12/2023
2 Test Station 2 John Doe 9/12/2023 Not Started NaN
3 Alibaba Fizgerald Super Man 9/12/2023 Not Started NaN
4 Iceland Kingdom Super Woman 9/12/2023 Not Started NaN
5 Sent: 0 Completed: 1 NaN NaN NaN
6 Community Resident Name Date Document Status Last Update
7 Station Kingdom Pretty Woman 9/12/2023 Not Started NaN
8 My Goodness Ugly Man 7/21/2023 Completed 7/26/2023
9 Community Resident Name Date NaN NaN
10 Station Kingdom1 Pretty Woman2 8/22/2023 NaN NaN
11 My Goodness1 Ugly Man1 4/8/2023 NaN NaN
12 Landmark2 Nice Guys 9/12/2023 NaN NaN
13 Iceland Kingdom2 Mr. Heroshi2 7/14/2023 NaN NaN
14 More Kingdom 2 King Kong 8/31/2023 NaN NaN
如何修改当前代码以使其正常工作?
答:
1赞
Timeless
10/28/2023
#1
你之前的问题是一个移动的目标,我不排除这个问题也可能是。旧的正则表达式模式在新的 csv 上失败,仅仅是因为数据集的格式现在与这个格式不同(最初为该模式创建)。
现在,您可以尝试以下操作:
pat = r"(\w+),+$\n[^,]+.+?\n,+\n(.+?)(?=\n,{2,})" # to update in `read_block`
df = read_block(names=["Admissions", "Readmissions"], igidx=False)
演示 : [regex101]
向下滚动,因为我使用了给定屏幕截图/ csv的精确副本
输出:
print(df)
Community ResidentName Date DocumentStatus Last Update Resident Name
Admissions 0 Test Station Jane Doe 9/12/2023 Sent 9/12/2023 NaN
1 Test Station 2 John Doe 9/12/2023 Not Started NaN NaN
2 Alibaba Fizgerald Super Man 9/12/2023 Not Started NaN NaN NaN
3 Iceland Kingdom Super Woman 9/12/2023 Not Started NaN NaN
Readmissions 0 Station Kingdom NaN 9/12/2023 Not Started NaN Pretty Woman
1 My Goodness NaN 7/21/2023 Completed 7/26/2023 Ugly Man
我想在第一列中使用关键字(“Admissions”、“Readmissions”)作为代码中的变量进行单独的输出。
然后使用 .df.droplevel(1).reset_index(names="block")
评论
0赞
Java
10/28/2023
如何/在哪里应用 df.droplevel(1).reset_index(names=“block”) ?
1赞
Timeless
10/28/2023
你可以像这样应用它:。df = read_block(names=["Admissions", "Readmissions"], igidx=False).droplevel(1).reset_index(names="block")
1赞
Java
10/28/2023
谢谢。这是一件很棒的事情。我应该学习如何在 Python 表达式中做得更好。
0赞
Java
10/28/2023
问题。“阻止”的目的是什么?如果我输入“Admissions”或“Readmissions”,似乎我会得到相同的输出。
1赞
Timeless
10/28/2023
如果你的意思是,这个只会重命名第一列。如果没有它,您将获得一个名为 的列。检查熊猫。DataFrame.reset_index
。names=block
index
评论