提问人:Mikołaj Sobczak 提问时间:10/29/2023 最后编辑:Mikołaj Sobczak 更新时间:10/29/2023 访问量:23
仅提取表的某些列的最佳方法是什么
What is the best approach to extract only some columns of the tables
问:
我正在编写一个程序。用户可以上传PDF文件。每个文件都有一个表。每个表都有列,例如(产品名称、单位、价格),有些文件可能不止这三列。我需要将每个pdf文件的三列数据保存到我的数据库中。 我的问题是我不知道哪种方法最适合我的项目。我用谷歌搜索了很多,但没有找到我问题的答案。
所以我正在处理的 3 个 PDF。
Pdfplumber 方法向我展示了,表格是嵌套的。
pdf1 = [[['WARZYWA', None, None], ['Nazwa produktu:', 'j.m.', 'Cena Netto:'], ['BAKŁAŻAN', 'kg', '8,51'], ['BROKUŁ 500G', 'szt', '5,29'], ['BRUKSELKA', 'kg', '0,00'], ['BURAK', 'kg', '2,19'], ['BURAK ŻÓŁTY', 'kg', '11,50'], ['BOCZNIAK', 'kg', '19,55'], ['BOTWINA PĘCZEK', 'szt', '3,80'], ['CEBULA', 'kg', '1,61'], ['CEBULA CHALLOT', 'kg', '11,50'], ['CEBULA CZERWONA', 'kg', '2,99'], ['CUKINIA KAL. 14', 'kg', '9,20'], ['CHRZAN KORZEŃ', 'kg', '16,10'], ['CYKORIA CZERWONA 250G', 'szt', '0,00'], ['CYKORIA ZIELONA 500G', 'szt', '6,56'], ['CZOSNEK', 'kg', '23,00'], ['CZOSNEK', 'szt', '1,84'], ['CZOSNEK CZARNY 100G', 'szt', '9,20'], ['DYNIA', 'kg', '0,00'], ['DYNIA HOKKAIDO', 'kg', '0,00'], ['DYNIA MAKARONOWA', 'kg', '0,00'], ['DYNIA PIŻMOWA', 'kg', '5,75'], ['FASOLKA SZPARAGOWA ZIELONA', 'kg', '19,55'], ['FASOLKA SZPARAGOWA ŻÓŁTA', 'kg', '0,00'], ['GROSZEK CUKROWY 250G', 'szt', '12,65'], ['GRZYBY SHIMEJI 150G', 'szt', '7,25'], ['IMBIR', 'kg', '12,65'], ['KABACZEK', 'kg', '0,00'], ['KALAREPA', 'szt', '2,59'], ['KALAFIOR KAL. 6', 'szt', '7,82'], ['KAPUSTA BIAŁA', 'kg', '2,76'], ['KAPUSTA CZERWONA', 'kg', '3,22'], ['KAPUSTA MŁODA', 'szt', '5,75'], ['KAPUSTA PAK CHOI 2SZT', 'szt', '7,48'], ['KAPUSTA PEKIŃSKA MŁODA', 'kg', '5,75'], ['KARCZOCHY', 'szt', '7,13'], ['KOPER', 'kg', '26,45'], ['KOPER WŁOSKI', 'kg', '14,49']]]
pdf2 = [['PRODUKT', 'Kraj poch.', 'JEDNOSTKA', 'CENA NETTO', 'VAT %', 'CENA BRUTTO', None], ['NOWOŚCI', None, None, None, None, None, None], ['CUKINIA ŻÓŁTA IMPORT', 'Holandia', 'kg', '8,50 zł', '0', '8,50 zł', ''], [None, None, None, None, None, None, ''], ['DONICZKA TYMIANEK CYTRYNOWY', 'Polska', 'szt.', '9,60 zł', '8', '10,37 zł', ''], [None, None, None, None, None, None, ''], ['DONICZKA WERBENA CYTRYNOWA', 'Polska', 'szt.', '11,90 zł', '0', '11,90 zł', ''], [None, None, None, None, None, None, ''], ['MIKRO PĘDY CZOSNKU 50SZT.', 'Polska', 'op.', '8,90 zł', '0', '8,90 zł', ''], [None, None, None, None, None, None, ''], ['PUREE Z MIRABELKI 1 KG', 'Francja', 'op.', '49,72 zł', '0', '49,72 zł', ''], [None, None, None, None, None, None, ''], ['SZPARAGI BIAŁE 500G PL', 'Polska', 'szt.', '17,65 zł', '0', '17,65 zł', ''], [None, None, None, None, None, None, ''], ['SZPARAGI FIOLETOWE KG', 'Holandia', 'kg', '24,06 zł', '0', '24,06 zł', ''], [None, None, None, None, None, None, ''], ['SZPARAGI ZIELONE 500G PL', 'Polska', 'szt.', '18,44 zł', '0', '18,44 zł', ''], [None, None, None, None, None, None, ''], ['WARZYWA', None, None, None, None, None, None], ['BAKŁAŻAN', 'Holandia', 'kg', '10,27 zł', '0', '10,27 zł', ''], [None, None, None, None, None, None, ''], ['BOTWINA PĘCZEK', 'Polska', 'kg.', '15,24 zł', '0', '15,24 zł', ''], [None, None, None, None, None, None, ''], ['BROKUŁ 500G', 'Hiszpania', 'szt.', '8,02 zł', '0', '8,02 zł', ''], [None, None, None, None, None, None, ''], ['BROKUŁ KG', 'Hiszpania', 'kg', '16,04 zł', '0', '16,04 zł', ''], [None, None, None, None, None, None, ''], ['BRUKIEW W', 'ielka Brytani', 'a kg', '4,01 zł', '0', '4,01 zł', ''], [None, None, None, None, None, None, ''], ['BURAK CHIOGGIA', 'Holandia', 'kg.', '14,44 zł', '0', '14,44 zł', ''], [None, None, None, None, None, None, ''], ['BURAK PL', 'Polska', 'kg', '3,69 zł', '0', '3,69 zł', ''], [None, None, None, None, None, None, ''], ['BURAK ŻÓŁTY', 'Holandia', 'kg.', '14,44 zł', '0', '14,44 zł', ''], [None, None, None, None, None, None, ''], ['BURAKI CAŁE GOTOWANE 500G', 'Francja', 'op.', '4,34 zł', '0', '4,34 zł', ''], [None, None, None, None, None, None, ''], ['CEBULA 75-105 IMP', 'Holandia', 'kg', '2,65 zł', '0', '2,65 zł', ''], [None, None, None, None, None, None, ''], ['CEBULA CZERWONA IMP', 'Holandia', 'kg', '4,34 zł', '0', '4,34 zł', ''], [None, None, None, None, None, None, ''], ['CEBULA SHALLOT', 'Francja', 'kg', '16,85 zł', '0', '16,85 zł', ''], [None, None, None, None, None, None, ''], ['CEBULKA PERŁOWA SREBRNA', 'Holandia', 'kg', '19,25 zł', '0', '19,25 zł', ''], [None, None, None, None, None, None, ''], ['CHRZAN IMP', 'Niemcy', 'kg.', '24,06 zł', '0', '24,06 zł', ''], [None, None, None, None, None, None, ''], ['CUKINIA ZIELONA IMP', 'Hiszpania', 'kg', '8,82 zł', '0', '8,82 zł', ''], [None, None, None, None, None, None, ''], ['CYKORIA CZERWONA', 'Holandia', 'kg', '48,12 zł', '0', '48,12 zł', ''], [None, None, None, None, None, None, ''], ['CYKORIA ZIELONA 500G', 'Holandia', 'op.', '7,38 zł', '0', '7,38 zł', ''], [None, None, None, None, None, None, ''], ['CZOSNEK SZT', 'Polska', 'szt.', '2,89 zł', '0', '2,89 zł', ''], [None, None, None, None, None, None, ''], ['DYNIA PIŻMOWA', 'Hiszpania', 'kg.', '5,13 zł', '0', '5,13 zł', ''], [None, None, None, None, None, None, ''], ['FASOLKA SZEROKA', 'Maroko', 'kg', '25,66 zł', '0', '25,66 zł', ''], [None, None, None, None, None, None, ''], ['FASOLKA SZPARAGOWA ZIELONA IMP', 'Senegal', 'kg', '25,66 zł', '0', '25,66 zł', ''], [None, None, None, None, None, None, ''], ['GALANGAL', 'Tajlandia', 'kg', '153,97 zł', '0', '153,97 zł', ''], [None, None, None, None, None, None, '']]
pdf3 - IndexError: list index out of range # but I can extract-text from this file.
这是我编写的简单函数,适用于上面的数据
# data is nested list I got from pdfplumber extract_table()
def get_data_by_column_header(data, column_header):
header_row = data[0]
print(header_row)
try:
column_index = header_row.index(column_header)
except ValueError:
return []
column_data = [row[column_index] for row in data[1:]]
return column_data
column_header = 'PRODUKT'
result = get_data_by_column_header(data2, column_header)
print(result)
它有点工作,但我无法弄清楚如何编写一个程序来解析每个 PDF 用户上传的内容。我应该在嵌套列表中签入关键字,然后按索引进行解析吗?但是在 pdf3 中它不起作用,因为 pdfplumber 看不到任何表格。
也尝试过 tabula-py。Tabula 可以将 pdf3 视为一个表格。
但在 lattice=True 中,只显示列标题:
[Empty DataFrame Columns: [Nr produktuNazwaj.m.ProducentCena nettoVAT [%]Cena brutto]
with stream=True,看不到标头,但显示数据:
[ Unnamed: 0 Polewy, syropy Unnamed: 1 ... Unnamed: 3 Unnamed: 4 Unnamed: 5 0 NaN Syropy NaN ... NaN NaN NaN 1 NaN NaN NaN ... NaN NaN NaN 2 I136:13792 Syrop żurawinowy 1L SZT ... 35,61 8.0 38,46 3 I136:12984 Syrop zielona mięta 1L SZT ... 26,72 8.0 28,86 ...
这是适用于白板并提取我想要的列的功能。
def extract_column_from_pdf(pdf_file, column_name):
df = tabula.read_pdf(pdf_file, pages=1, stream=True)[0]
# df = tabula.read_pdf(pdf_file, pages=1, lattice=True)[0]
if column_name in df.columns:
extracted_column = df[column_name]
return extracted_column
else:
return None
column_name = "PRODUKT"
result = extract_column_from_pdf(pdf2, column_name)
if result is not None:
print(result)
else:
print(f"Column not found in the PDF.")
由于这些嵌套列,也适用于一个文件。这是我的主要问题。如何按列名解析抛出的表并从该列中获取所有内容?有没有其他方法可以解决我的专栏问题?也许有一种更简单的方法可以提取我想保存到数据库中的列?
答: 暂无答案
评论