Tabula 未读取我的 pdf/all 数据为空白

Tabula not reading my pdf/all data comes in as blank

提问人:Analyst4 提问时间:11/18/2023 最后编辑:WoodfordAnalyst4 更新时间:11/18/2023 访问量:18

问:

我正在尝试获取此 pdf:https://www.occ.gov/topics/charters-and-licensing/weekly-bulletin/2023/wb-11052023-11112023.pdf 并导出为包含“ACTION”、“DATE”、“BANK NAME”、“LOCATION”、“CITY”、“STATE”列的 csv

我的代码如下:

import tabula
import pandas as pd


pdf_path = '*pdf file path*'

# Read PDF into a list of DataFrame
dfs = tabula.read_pdf(pdf_path, pages='2', multiple_tables=True)

# Concatenate DataFrames into a single DataFrame
df = pd.concat(dfs)

# Specify the columns to keep
columns_to_keep = ["ACTION", "DATE", "TYPE",  "BANK NAME", "LOCATION", "CITY", "STATE"]

# Select only the relevant columns
df = df[columns_to_keep]

# Drop rows with all NaN values
#df = df.dropna(how='all')

# Write the DataFrame to a CSV file
df.to_csv("output.csv", index=False)

print("CSV file generated successfully.")

这能够为我的 csv 生成良好的标头,但数据是空的。有人有这方面的经验吗?现在,只使用第 2 页进行测试,但理想情况下需要整个 pdf。

尝试了白板函数,但输出为空

python pdf 白板

评论


答: 暂无答案