解析 PDF 以获取地址和日期之间的文本-解网

问：

我想从解析pdf中获取一些信息（Rel类型）。现在的问题是文本卡在地址和日期之间的某个时间。我附上了pdf页面的图片。当我解析页面时，我得到了以下文本。

 Court Order -  01/30/2023    400 Block HANLON WAY\n                                                                                             Own\n                                                                                             Recognizance\n\n

如何获得rel_type = 法院命令 - 自己的认可 我已经尝试过这个正则表达式，但它只给了我完整的 rel 类型，它没有分成很多行

rel_typesmacthes = total_page.scan(/([MF])\s+(\d{2})\s+(\d{3})\s+([A-Z]{3}|t Specif t Spee)?\s+([A-Z]{3}|t Specif| t Spee)?\s+(.+)?(\d{1,2}\/\d{1,2}\/\d{2}\s+\d{1,2}:\d{2}\s+(?:am|pm))\s+(.+)?\s+(?:(\d{1,2}\/\d{1,2}\/2023))?/)

这是页面。Inspect 看起来像

                Name                                       BookDate Time                DateOfBirth              Booking #            Bail Amount

                    MARK,DEONTAE                                   1/28/23 9:40 am               02/25/1998              CC23NM711        $0 Race          Gen    Height   Weight    Hair  Eyes     Job Description    Arrest Date Time   Rel type       Release Date Arrest Location

BLACK           M      72       159     BLK    BRO                     1/28/23 8:26 am                                 800 Block VAQUEROS AVE RODEO   Arrest Type

     On View

        Charge                                     Charge Description
              148(A)(1) PC                         OBSTRUCT/ETC PUB OFCR/ETC

        Charge                                     Charge Description

              273.6(A) PC                          VIO ORD:PREVNT DOMES VIOL   Arrest Type

     Parole Hold

        Charge                                     Charge Description

              3000.08 PC                           VIOLATION OF PAROLE
        Charge                                     Charge Description

              AB109                                AB109 REALIGNMENT

                        Name                                       BookDate Time                DateOfBirth              Booking #        Bail Amount
                MENDOZA-FREGOZA,LIDIO                              1/28/23 3:25 pm               10/20/1989              CC23NM719        $0

Race          Gen    Height   Weight    Hair  Eyes     Job Description Arrest Date Time   Rel type       Release Date Arrest Location HISPANIC        M      67       175     BLK    BRO                     1/28/23 2:42 pm    Court Order -  01/30/2023    400 Block HANLON WAY
                                                                                             Own
                                                                                             Recognizance

  Arrest Type
     Bench Warrant

        Charge                                     Charge Description

              10851(A) VC                          VEHICLE THEFT

        Charge                                     Charge Description
              466 PC                               POSSESS BURGLARY TOOLS

        Charge                                     Charge Description

              496D(A) PC                           POSS STOLEN VEH/VES/ETC

        Charge                                     Charge Description
              594(A) PC                            VANDALISM

        Charge                                     Charge Description

              978.5 PC                             BENCH WARRANT:FTA:FELONY

正则表达式 Ruby 解析 PDF

s = ["Race          Gen    Height   Weight    Hair  Eyes     Job Description Arrest Date Time   Rel type       Release Date Arrest Location HISPANIC        M      67       175     BLK    BRO                     1/28/23 2:42 pm    Court Order -  01/30/2023    400 Block HANLON WAY",
     "                                                                                               Own",
     "                                                                                             Recognizance"]

要巧妙地从 pdf 中读取文本并将其解析为数组中的一行一行，如上所示的数据：

require 'pdf/reader'

reader = PDF::Reader.new(pdf_file_path)

reader.pages.each do |page|
  page.text.lines each do |line|
    # now you handling it as array, line by line
  end
end

逐行扫描：

(start, stop) = s[0].scan(/Rel type/).map { [Regexp.last_match.begin(0), Regexp.last_match.end(0)] }[0]

检查我们是否找到了带有发布类型标头的下一行：

if (start != nil)

拆分现有行并从当前（拆分线）和以下行中获取数据文本：

data = s[0].scan(/Rel type.*Arrest Location(.*)/)
puts data[0][0][90 .. 105].strip
puts s[1][start .. -1].strip
puts s[2][start .. -1].strip

最初，我想使用start和stop来确定是否有多列文本。但是列是固定的（根据我抓取 PDF 的经验），因此所有行的文本位置都应该相同，您可以使用固定偏移量。如果没有，请使用 start 和 stop。

执行此代码可为我提供：

Court Order -
Own
Recognizance

解析 PDF 以获取地址和日期之间的文本

parsing a pdf to get the text between address and date

评论

评论