如何使用第一列中的值从 csv 文件中提取某些数据块-解网

问：

我试图从第一列的关键字中从第三行和第二列开始，仅提取两个部分（“入院”和“再入院”）中的数据。

请参阅底部的 csv 文件中的数据集示例。

Admissions
Not Started:12  Sent:3            Completed:3

Division        Community         ResidentName   Date        DocumentStatus  Last Update
                Test Station      Jane Doe       9/12/2023   Sent            9/12/2023
                Test Station 2    John Doe       9/12/2023   Not Started       
                Alibaba Fizgerald Super Man      9/12/2023   Not Started  
                Iceland Kingdom   Super Woman    9/12/2023   Not Started

,,,,,

Readmissions  
Not Started:1   Sent:0            Completed:1 

Division        Community         Resident Name   Date         DocumentStatus  Last Update
                Station Kingdom   Pretty Woman    9/12/2023    Not Started  
                My Goodness       Ugly Man        7/21/2023    Completed       7/26/2023

,,,,

Discharge 

Division        Community         Resident Name   Date            
                Station Kingdom1  Pretty Woman2   8/22/2023      
                My Goodness1      Ugly Man1       4/8/2023     
                Landmark2         Nice Guys       9/12/2023     
                Iceland Kingdom2  Mr. Heroshi2    7/14/2023    
                More Kingdom 2    King Kong       8/31/2023

所以，逻辑是：

找到包含 column1 = 'Admissions' 或 'Readmissions' 的数据的行。

向下走 3 行，向右走 1 列。获取 5 列的所有数据，直到它到达没有数据的行。

因此，我想将蓝框部分作为每个部分的输出：

这是一个插图：

我从上一篇文章中得到了这个代码，但几周来我都无法解决它/获得解决方案，所以我发布了这个新代码。

import re 
import pandas as pd
from io import StringIO

def read_block(names, igidx=True):
    with open("Test1.csv") as f:
        pat = r"(\w+),+$\n+(.+?)(?=\n\w+,+\n$|\Z)"
        return pd.concat([
            pd.read_csv(StringIO(m.group(2)), skipinitialspace=True)
                .iloc[:, 1:].dropna(how="all") for m in re.finditer(
                    pat, f.read(), flags=re.M|re.S) if m.group(1) in names # optional
        ], keys=names, ignore_index=igidx)

df = read_block(names=["Admissions"])    
print(df)

底部是电流输出，但它连接了所有三个部分。

我想在第一列中使用关键字（“Admissions”、“Readmissions”）作为代码中的变量进行单独的输出。

          Sent: 3    Completed: 3 Unnamed: 3       Unnamed: 4   Unnamed: 5
0           Community   Resident Name       Date  Document Status  Last Update
1        Test Station       Jane Doe   9/12/2023             Sent    9/12/2023
2      Test Station 2        John Doe  9/12/2023      Not Started          NaN
3   Alibaba Fizgerald       Super Man  9/12/2023      Not Started          NaN
4     Iceland Kingdom     Super Woman  9/12/2023      Not Started          NaN
5             Sent: 0    Completed: 1        NaN              NaN          NaN
6           Community   Resident Name       Date  Document Status  Last Update
7     Station Kingdom   Pretty Woman   9/12/2023      Not Started          NaN
8         My Goodness        Ugly Man  7/21/2023        Completed    7/26/2023
9           Community   Resident Name       Date              NaN          NaN
10  Station Kingdom1   Pretty Woman2   8/22/2023              NaN          NaN
11      My Goodness1        Ugly Man1   4/8/2023              NaN          NaN
12          Landmark2       Nice Guys  9/12/2023              NaN          NaN
13   Iceland Kingdom2    Mr. Heroshi2  7/14/2023              NaN          NaN
14     More Kingdom 2      King Kong   8/31/2023              NaN          NaN

如何修改当前代码以使其正常工作？

Python pandas 正则表达式

print(df)

                                  Community ResidentName         Date DocumentStatus Last Update Resident Name
Admissions   0                 Test Station     Jane Doe    9/12/2023           Sent   9/12/2023           NaN
             1               Test Station 2     John Doe    9/12/2023    Not Started         NaN           NaN
             2  Alibaba Fizgerald Super Man    9/12/2023  Not Started            NaN         NaN           NaN
             3              Iceland Kingdom  Super Woman    9/12/2023   Not Started          NaN           NaN
Readmissions 0              Station Kingdom          NaN    9/12/2023    Not Started         NaN  Pretty Woman
             1                  My Goodness          NaN    7/21/2023      Completed  7/26/2023       Ugly Man

我想在第一列中使用关键字（“Admissions”、“Readmissions”）作为代码中的变量进行单独的输出。

然后使用 .df.droplevel(1).reset_index(names="block")

如何使用第一列中的值从 csv 文件中提取某些数据块

How to extract certain blocks of data from a csv file using values in the first column

评论

评论