Tabula 不提取 pdf 中的每一行数据

Tabula not extracting every row of data in a pdf

提问人:Mark k 提问时间:9/19/2023 更新时间:9/19/2023 访问量:19

问:

我目前有一个问题,我有点挣扎。我正在尝试使用表格库从 pdf 中提取每一行。我遇到的问题是,通常脚本可以正确提取数据,但由于某种原因,它没有提取每一行:

Number of pages read: 20
Processing page 1 with 25 rows
Processing page 2 with 18 rows
Processing page 3 with 24 rows
Processing page 4 with 22 rows
Processing page 5 with 18 rows
Processing page 6 with 4 rows
Processing page 7 with 20 rows
Processing page 8 with 13 rows
Processing page 9 with 3 rows
Processing page 10 with 7 rows
Processing page 11 with 17 rows
Processing page 12 with 23 rows
Processing page 13 with 23 rows
Processing page 14 with 29 rows
Processing page 15 with 20 rows
Processing page 16 with 17 rows
Processing page 17 with 22 rows
Processing page 18 with 25 rows
Processing page 19 with 21 rows
Processing page 20 with 6 rows
Total rows after cleaning: 246

因此,第 6、9、10 和 20 页没有被完全提取出来。pdf 本身最好被描述为具有没有任何线条或边框的表格,因此在仅使用文本的垂直和水平策略时,使用 pdfplumber 有点棘手,因此沿着我所做的路线走下去。

我写的脚本如下所示:

import pandas as pd
import tabula
import getpass
from tabula import read_pdf

username = getpass.getuser()

# Fetch the PDF link
url = "https://dac.gouvernement.lu/fr/administration/departements/navigabilite/immatriculation-aeronefs/releve-immatriculations.html"
response = requests.get(url)
tree = html.fromstring(response.content)
pdf_link = tree.xpath('//*[@id="main"]/article/div/div/p[5]/a/@href')[0]

# If the link is relative, convert it to an absolute URL
if not pdf_link.startswith("http"):
    base_url = "https://dac.gouvernement.lu"
    pdf_link = f"{base_url}{pdf_link}"

# Download the PDF
response = requests.get(pdf_link)
pdf_path = f"C:/Users/{username}/Downloads/latest_pdf.pdf"

# Save the downloaded PDF
with open(pdf_path, "wb") as f:
    f.write(response.content)

# Read tables into a list of DataFrames, without headers
dfs = tabula.read_pdf(pdf_path, pages='all', pandas_options={'header': None})

print(f"Number of pages read: {len(dfs)}")

# Create an empty list to store cleaned rows
cleaned_rows = []

# Loop through each DataFrame in the list
for i, df in enumerate(dfs):
    print(f"Processing page {i+1} with {len(df)} rows")
    
    current_row = []

    for index, row in df.iterrows():
        if pd.notna(row[0]) and str(row[0]).startswith("LX-"):
            if current_row:
                cleaned_rows.append(current_row)
            current_row = list(row)
        else:
            for idx, element in enumerate(row):
                if pd.notna(element):
                    current_row[idx] = str(current_row[idx]) + " " + str(element) if pd.notna(current_row[idx]) else element

    if current_row:
        cleaned_rows.append(current_row)

print(f"Total rows after cleaning: {len(cleaned_rows)}")

# Convert the list of cleaned rows into a DataFrame
cleaned_df = pd.DataFrame(cleaned_rows)

# Assign column names
cleaned_df.columns = ['Registration_Mark', 'Manufacturer', 'Aircraft_type', 'MSN', 'Registered_Owner', 'Registered_Operator']

cleaned_df['published_date'] = datetime.now().date().strftime('%Y-%m-%d')
cleaned_df['index'] = cleaned_df.index

我希望有人能为我指出正确的方向,告诉我如何提高准确性并让脚本提取所有内容

Python 熊猫 白板

评论


答: 暂无答案