使用 python 从 CSV 文件中提取表格

Extracting Table from CSV file using python

提问人:Mark M. 提问时间:11/16/2023 最后编辑:BigBenMark M. 更新时间:11/17/2023 访问量:63

问:

我正在尝试使用 Python 从报表中提取特定表并将数据编译为单个表。但是,似乎一旦它选择或找到所需的列名系列,它就会复制下面的所有数据,包括机器无法读取的所有必要数据。输出应采用表格格式。我在Jupyter Notebook中运行它。

法典

import csv
import pandas as pd

# Function to extract relevant data from the CSV file
def extract_data_from_csv(file_path):
    data_start = False
    headers = []
    rows = []

    with open(file_path, newline='') as csvfile:
        csv_reader = csv.reader(csvfile)
        
        for row in csv_reader:
            if not data_start:
                if row and row[0] == "SUBSCRIBER NAME":
                    headers = row
                    data_start = True
            elif data_start and row:
                if row[0].strip().upper() == "PRODUCT TOTAL":
                    break
                rows.append(row)

    return headers, rows

# Specify the path to your CSV file
csv_file_path = 'Testing/sample report.csv'

# Extract data
headers, rows = extract_data_from_csv(csv_file_path)

# Create a DataFrame using pandas
df = pd.DataFrame(rows, columns=headers)

# Display the DataFrame as a table
display(df)

# Save the DataFrame to a new CSV file
output_csv_file_path = 'mark_table_1.csv'
df.to_csv(output_csv_file_path, index=False)

有关文件,请参阅谷歌链接:链接

python pandas csv jupyter-notebook data-science

评论


答:

1赞 Isaac Rene 11/16/2023 #1

我还没有测试过它,但您可以尝试使用以下命令进行更改extract_data_from_csv


# Function to extract relevant data from the CSV file
def extract_data_from_csv(file_path):
    data_start = False
    headers = []
    rows = []

    with open(file_path, newline='') as csvfile:
        csv_reader = csv.reader(csvfile)
        
        for row in csv_reader:
            if not data_start:
                if row and row[0] == "SUBSCRIBER NAME":
                    headers = row
                    data_start = True
            elif data_start and row:
                if row[5].strip().upper() == "PRODUCT TOTAL": #changed row index to 5 
                    data_start = False #in place of break
                rows.append(row)

    return headers, rows


其中进行了两项更改:

  1. 索引从 更改为row[0].strip().upper() == "PRODUCT TOTAL"row[5].strip().upper() == "PRODUCT TOTAL"
  2. 替换为 暂时停止行写入,直到出现新的“SUBSCRIBER NAME”表;请注意,这将在语句中被重写,但这应该不是问题,因为它们始终是相同的,您也可以进一步修改函数以避免这种情况breakdata_start = Falseheadersif
1赞 Jon Clements 11/17/2023 #2

可能需要考虑一个自定义生成器函数,该函数循环访问文件,在特定条件下开始捕获,然后消耗并生成其余条件,直到另一个条件(或文件末尾)并中断以继续查找第一个条件(与您所做的没有太大不同),但例如:

import csv
import pandas as pd

def extract(filename):
    with open(filename) as fin:
        csvin = csv.reader(fin)
        for row in csvin:
            if row and row[0] == 'SUBSCRIBER NAME':
                for subrow in csvin:
                    if subrow and subrow[0]: # change me perhaps?
                        yield dict(zip(row, subrow))
                    else:
                        break

然后通过以下方式获取 DataFrame:

df = pd.DataFrame(extract('your_file_name'))

评论

0赞 Mark M. 11/17/2023
嗨,谢谢你的大力帮助。我想到了!。我想粘贴整个代码,但似乎太长了。哈哈。再次感谢。
1赞 Mark M. 11/17/2023 #3

我想到了。

共享整个代码以供将来参考。

import csv
import pandas as pd

# Function to extract relevant data from the CSV file
def extract_data_from_csv(file_path):
    headers = None
    rows = []

    with open(file_path, newline='') as fin:
        csvin = csv.reader(fin)
        for row in csvin:
            if row and row[0] == 'SUBSCRIBER NAME':
                headers = row
                for subrow in csvin:
                    if subrow and subrow[0].strip().upper() == 'PRODUCT TOTAL':
                        break
                    elif subrow and subrow[0]:
                        rows.append(subrow)
                    else:
                        break

    return headers, rows

# Specify the path to your CSV file
csv_file_path = 'Testing/0976_Bills_Payment_Abstract_of_Collection_11142023_115828.csv'

# Extract data using the new extract function
headers, rows = extract_data_from_csv(csv_file_path)

# Create a DataFrame using pandas
df = pd.DataFrame(rows, columns=headers)

# Display the DataFrame as a table
display(df)

# Save the DataFrame to a new CSV file
output_csv_file_path = 'mark_table_2.csv'
df.to_csv(output_csv_file_path, index=False)