VBA - 从扫描的 PDF 中获取文本并将其保存在 Excel 中

VBA - Get text from scanned PDF and save it in Excel

提问人:Panka Bálint 提问时间:11/11/2023 更新时间:11/11/2023 访问量:59

问:

我有一个非常具体的问题。我有一个用于从 PDF 文件中提取文本并将其保存在 Excel 中的代码。问题是由于文本阅读问题,它不适用于扫描的pdf文件。

我的代码执行以下操作:

  1. 打开 PDF
  2. 获取页面并突出显示页面中的文本
  3. 将其保存在变体中
  4. 运行上述变体并写入每个单词(这是基本代码,现在它只写下特定值)
  5. 关闭 PDF

我希望它也适用于扫描的 PDF。我认为问题在于它无法突出显示文本,因为它更像是保存为 PDF 的图片,而不是真正的书面 PDF。 这是我的代码(我也没有制作这个代码,但在互联网上找到了它):

Public Function Get_VIN_From_CoC(PDF_File As String, OnWhichPage As Integer) As String

'This procedure get the PDF data into excel by following way
'1.Open PDF file
'2.Looping through pages
'3.get the each PDF page data

Dim AC_PD As Acrobat.AcroPDDoc 'access pdf file
Dim AC_Hi As Acrobat.AcroHiliteList 'set selection word count
Dim AC_PG As Acrobat.AcroPDPage 'get the particular page
Dim AC_PGTxt As Acrobat.AcroPDTextSelect 'get the text of selection area
Dim Ct_Page As Long 'count pages in pdf file
Dim j As Long, K As Long 'looping variables
Dim T_Str As String
Dim Hld_Txt As Variant 'get PDF total text into array
Dim VIN As String

Set AC_PD = New Acrobat.AcroPDDoc
Set AC_Hi = New Acrobat.AcroHiliteList

'set maximum selection area of PDF page
AC_Hi.Add 0, 32767

With AC_PD
    'open PDF file
    .Open PDF_File
    'get the number of pages of PDF file
    Ct_Page = .GetNumPages
    'if get pages is failed exit sub
    If Ct_Page = -1 Then
        MsgBox "Pages Cannot determine in PDF file '" & PDF_File & "'"
        .Close
        GoTo h_end
    End If

    T_Str = ""
    'get the page
    Set AC_PG = .AcquirePage(OnWhichPage)
    
    'get the full page selection
    Set AC_PGTxt = AC_PG.CreateWordHilite(AC_Hi)
    
    'if text selected successfully get the all the text into T_Str string
    If Not AC_PGTxt Is Nothing Then
        With AC_PGTxt
            For j = 0 To .GetNumText - 1
                T_Str = T_Str & .GetText(j)
            Next j
        End With
    End If


    'get the PDF data into each sheet for each PDF page
    'if text accessed successfully then split T_Str by VbCrLf
    'and get into array Hld_Txt and looping through array and fill sheet with PDF data
    If T_Str <> "" Then
        Hld_Txt = Split(T_Str, vbCrLf)
        For K = 0 To UBound(Hld_Txt)
            T_Str = CStr(Hld_Txt(K))
            If Left(T_Str, 1) = "=" Then T_Str = "'" & T_Str
            MsgBox T_Str
            If Right(T_Str, 6) = "(Kg) :" Then VIN = CStr(Hld_Txt(K + 1))
                
        Next K
    Else
        'information if text not retrive from PDF page
        MsgBox "No text found in page "
    End If
    
.Close
End With

h_end:
Set AC_PGTxt = Nothing
Set AC_PG = Nothing
Set AC_Hi = Nothing
Set AC_PD = Nothing

Get_VIN_From_CoC = VIN

End Function

你能帮我解决这个问题吗?

Excel VBA Acrobat 文本提取

评论

1赞 Shrotter 11/11/2023
如果在创建 PDF 期间未执行文本识别,则无法提取任何文本。因此,您需要执行光学字符识别 (OCR)。

答: 暂无答案