如何使用第一列中的值从 csv 文件中提取某些数据块

How to extract certain blocks of data from a csv file using values in the first column

提问人:Java 提问时间:10/28/2023 最后编辑:Java 更新时间:10/28/2023 访问量:81

问:

我试图从第一列的关键字中从第三行和第二列开始,仅提取两个部分(“入院”和“再入院”)中的数据。

请参阅底部的 csv 文件中的数据集示例。

Admissions
Not Started:12  Sent:3            Completed:3

Division        Community         ResidentName   Date        DocumentStatus  Last Update
                Test Station      Jane Doe       9/12/2023   Sent            9/12/2023
                Test Station 2    John Doe       9/12/2023   Not Started       
                Alibaba Fizgerald Super Man      9/12/2023   Not Started  
                Iceland Kingdom   Super Woman    9/12/2023   Not Started 

,,,,,

Readmissions  
Not Started:1   Sent:0            Completed:1 

Division        Community         Resident Name   Date         DocumentStatus  Last Update
                Station Kingdom   Pretty Woman    9/12/2023    Not Started  
                My Goodness       Ugly Man        7/21/2023    Completed       7/26/2023 

,,,,

Discharge 

Division        Community         Resident Name   Date            
                Station Kingdom1  Pretty Woman2   8/22/2023      
                My Goodness1      Ugly Man1       4/8/2023     
                Landmark2         Nice Guys       9/12/2023     
                Iceland Kingdom2  Mr. Heroshi2    7/14/2023    
                More Kingdom 2    King Kong       8/31/2023

所以,逻辑是:

找到包含 column1 = 'Admissions' 或 'Readmissions' 的数据的行。

向下走 3 行,向右走 1 列。 获取 5 列的所有数据,直到它到达没有数据的行。

因此,我想将蓝框部分作为每个部分的输出:

这是一个插图:enter image description here

我从上一篇文章中得到了这个代码,但几周来我都无法解决它/获得解决方案,所以我发布了这个新代码。

import re 
import pandas as pd
from io import StringIO

def read_block(names, igidx=True):
    with open("Test1.csv") as f:
        pat = r"(\w+),+$\n+(.+?)(?=\n\w+,+\n$|\Z)"
        return pd.concat([
            pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                    pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
        ], keys=names, ignore_index=igidx)

df = read_block(names=["Admissions"])    
print(df)  

底部是电流输出,但它连接了所有三个部分。

我想在第一列中使用关键字(“Admissions”、“Readmissions”)作为代码中的变量进行单独的输出。

          Sent: 3    Completed: 3 Unnamed: 3       Unnamed: 4   Unnamed: 5
0           Community   Resident Name       Date  Document Status  Last Update
1        Test Station       Jane Doe   9/12/2023             Sent    9/12/2023
2      Test Station 2        John Doe  9/12/2023      Not Started          NaN
3   Alibaba Fizgerald       Super Man  9/12/2023      Not Started          NaN
4     Iceland Kingdom     Super Woman  9/12/2023      Not Started          NaN
5             Sent: 0    Completed: 1        NaN              NaN          NaN
6           Community   Resident Name       Date  Document Status  Last Update
7     Station Kingdom   Pretty Woman   9/12/2023      Not Started          NaN
8         My Goodness        Ugly Man  7/21/2023        Completed    7/26/2023
9           Community   Resident Name       Date              NaN          NaN
10  Station Kingdom1   Pretty Woman2   8/22/2023              NaN          NaN
11      My Goodness1        Ugly Man1   4/8/2023              NaN          NaN
12          Landmark2       Nice Guys  9/12/2023              NaN          NaN
13   Iceland Kingdom2    Mr. Heroshi2  7/14/2023              NaN          NaN
14     More Kingdom 2      King Kong   8/31/2023              NaN          NaN

如何修改当前代码以使其正常工作?

Python pandas 正则表达式

评论


答:

1赞 Timeless 10/28/2023 #1

之前的问题是一个移动的目标,我不排除这个问题也可能是。旧的正则表达式模式在新的 csv 上失败,仅仅是因为数据集的格式现在与这个格式不同(最初为该模式创建)。

现在,您可以尝试以下操作:

 pat = r"(\w+),+$\n[^,]+.+?\n,+\n(.+?)(?=\n,{2,})" # to update in `read_block`

 df = read_block(names=["Admissions", "Readmissions"], igidx=False)

演示 : [regex101]

向下滚动,因为我使用了给定屏幕截图/ csv的精确副本

输出:

print(df)

                                  Community ResidentName         Date DocumentStatus Last Update Resident Name
Admissions   0                 Test Station     Jane Doe    9/12/2023           Sent   9/12/2023           NaN
             1               Test Station 2     John Doe    9/12/2023    Not Started         NaN           NaN
             2  Alibaba Fizgerald Super Man    9/12/2023  Not Started            NaN         NaN           NaN
             3              Iceland Kingdom  Super Woman    9/12/2023   Not Started          NaN           NaN
Readmissions 0              Station Kingdom          NaN    9/12/2023    Not Started         NaN  Pretty Woman
             1                  My Goodness          NaN    7/21/2023      Completed  7/26/2023       Ugly Man
    

我想在第一列中使用关键字(“Admissions”、“Readmissions”)作为代码中的变量进行单独的输出。

然后使用 .df.droplevel(1).reset_index(names="block")

评论

0赞 Java 10/28/2023
如何/在哪里应用 df.droplevel(1).reset_index(names=“block”) ?
1赞 Timeless 10/28/2023
你可以像这样应用它:。df = read_block(names=["Admissions", "Readmissions"], igidx=False).droplevel(1).reset_index(names="block")
1赞 Java 10/28/2023
谢谢。这是一件很棒的事情。我应该学习如何在 Python 表达式中做得更好。
0赞 Java 10/28/2023
问题。“阻止”的目的是什么?如果我输入“Admissions”或“Readmissions”,似乎我会得到相同的输出。
1赞 Timeless 10/28/2023
如果你的意思是,这个只会重命名第一列。如果没有它,您将获得一个名为 的列。检查熊猫。DataFrame.reset_indexnames=blockindex