如何使用 iTextSharp Core 解决文本提取中的字体问题？-解网

问：

我正在尝试从此 PDF 文件中提取文本：

http://www.in.gov/legislative/iac/T03270/A00200.PDF

我收到的错误：

PDF Extraction Error
Dictionary doesn't have supported font data.

iText.Kernel.Exceptions.PdfException

at iText.Kernel.Font.PdfFontFactory.CreateFont(PdfDictionary fontDictionary)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.GetFont(PdfDictionary fontDict)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.SetTextFontOperator.Invoke(PdfCanvasProcessor processor, PdfLiteral operator, IList`1 operands)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.InvokeOperator(PdfLiteral operator, IList`1 operands)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[] contentBytes, PdfResources resources)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessPageContent(PdfPage page)
at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page, ITextExtractionStrategy strategy, IDictionary`2 additionalContentOperators)
at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page, ITextExtractionStrategy strategy)

有没有办法解决这个问题？使用简单提取策略，为什么需要关心字体？

我还尝试了基于位置的提取策略 - 相同的结果。

不幸的是，这是一份政府文件，所以如果它以某种方式腐败，我无法控制它。我已经从同一站点上的数百个其他文档中提取了文本，没有问题。

PDF 格式 itext7

谢谢，我一直在寻找一种仍然使用 iText 提取的方法 - 而不必通过从更杂乱的 PDF 阅读器导出来修复它。由于我能找到的每个阅读器似乎都能正确显示它，因此我不得不想象这要么是 iText 不知道的一些 PDF 细微差别，要么是应该可以恢复的错误。不知道 PDF 使用字体的字形映射，我以为它们大多是 ASCII 纯文本，带有一些大写字符的转义序列。直到

0赞 iPDFdev 10/25/2023

该文档使用带有 WinAnsi 编码的简单字体，没什么特别的，因此文本提取应该没有任何问题。可能是 iTextSharp Core 中的错误，但 iText 支持可以在这里为您提供帮助。

答：

0赞 reZach 10/30/2023 #1

我不完全确定您看到的是什么字体问题，但我有一个从 PDF 中提取文本的程序，我使用了您引用的文档，https://iac.iga.in.gov/iac//T03270/A00200.PDF，并且能够从每个页面中提取文本，而不会让程序抛出任何异常。看看我在下面使用的东西，看看这是否给你带来了任何问题。我有一个链接指向使用此代码的工作控制台应用程序的链接此处.

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;

namespace ScanTextInPDFs
{
    internal class Program
    {
        public static async Task Main(string[] args)
        {
            string executingDirectory = AppContext.BaseDirectory;

            byte[] bytes = await File.ReadAllBytesAsync($"{executingDirectory}PDFs\\A00200.pdf");
            string textToFind = "Lorem ipsum";
            bool foundText = false;

            using (MemoryStream memoryStream = new MemoryStream(bytes))
            {
                using PdfReader pdfReader = new PdfReader(memoryStream);
                using PdfDocument pdfDocument = new PdfDocument(pdfReader);

                for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
                {
                    PdfPage pdfPage = pdfDocument.GetPage(page);
                    string pageText = PdfTextExtractor.GetTextFromPage(pdfPage, new SimpleTextExtractionStrategy());

                    if (pageText.Contains(textToFind, StringComparison.Ordinal))
                        foundText = true;
                }
            }

            if (foundText)
                Console.WriteLine($"Found '{textToFind}' in the pdf.");
            else
                Console.WriteLine($"Did not find '{textToFind}' in the pdf.");
        }
    }
}

这是我的程序在此 PDF 中查找文本的输出。

上一个：iText7 无法生成 PDF

下一个：PDF417 条形码生成可编辑的 PDF

如何使用 iTextSharp Core 解决文本提取中的字体问题？

How do I work around a font issue in text extraction using iTextSharp Core?

评论