提问人:snoozy 提问时间:3/31/2023 更新时间:4/6/2023 访问量:38
以特定格式从 pdf 中提取数据
Extracting data from pdf in specific format
问:
我想以哈希的形式提取数据。文字是这样的。有时有一个逮捕类型和多个指控和指控描述,有时有一个逮捕类型和一个指控和指控描述。有时是多种逮捕类型和多种指控描述。我想要从下面的文本中得到这样的输出
{ Arrest Type=> Out Of County Warrant, charge => 11378 HS , Charge Description => POSS CNTL SUB FOR SALE , Arrest Type=> Out Of County Warrant, charge => 11379(A) HS , Charge Description => TRANSP/ETC CNTL SUB }
{ Arrest Type=> Bench Warrant, charge => 1203.2 PC , Charge Description => PROB VIOL:REARREST/REVOKE }
{ Arrest Type=> On View, charge => 11364 HS , Charge Description => CNTL SUB PARAPHERNALIA ,
Arrest Type=> On View, charge => 488 PC , Charge Description => PETTY THEFT,
Arrest Type=> Out Of County Warrant, charge => 487(C) PC , Charge Description => GRAND THEFT FROM PERSON }
逮捕类型\n 县外搜查令\n\n 指控
说明\n\n 11378 HS POSS CNTL SUB FOR SALE\n\n 收费说明\n 11379(A) HS
TRANSP/ETC CNTL SUB\n\n 收费
说明\n\n 978.5 PC
BENCH WARRANT:FTA:FELONY\n\n 姓名逮捕类型\n\n 法官逮捕令\n\n 指控
说明\n 1203.2 PC
prob viol:rearrest/revoke\n\n 姓名逮捕类型\n\n 查看\n\n 指控说明\n 11364 HS CNTL SUB PARAPHERNALIA\n\n 指控指控说明\n\n 488 PC 小偷
小摸\n\n 逮捕类型\n 县外逮捕令\n\n 指控
说明\n\n 487(C) PC GRAND THEFT FROM PERSON\n\n\n
\n GC 6254(f)1\n
姓名
这是我尝试过的一段代码。如何获得所需的输出?
total_page.scan(/Arrest Type\s+(.*?)\s+(Charge\s+(.*?)\s+Charge Description\s+(.*?)\s+|((?:Charge\s+(.*?)\s+Charge Description\s+(.*?)\s+)+)?)(?=Arrest Type|Name\z)/m).each do |match|
arrest_type = match[0].strip
charge_data = match[1].split(/\s{2,}/)
(0...charge_data.length).step(2) do |i|
new_hash = {
"Arrest Type" => arrest_type,
"charge" => charge_data[i],
"Charge Description" => charge_data[i+1]
}
arrest_data << new_hash
end
end
# print the resulting array of hashes
arrest_data.each do |arrest|
p arrest
end
答: 暂无答案
评论