如何使用pdfPlumber过滤特定区域内的文本并打开CV？-解网

问：

我有一堆来自会议论文集的pdf文件。
每个 pdf 文件的结构如下所示：

                   Tile with bold, large size font
    Author1                Author2                 AuthorN
Afflication1          Afflication2            AfflicationN
Email1                  Email2                   Email3

我使用pdfPlumber选择字体大小最大的字符作为标题，它有效。
为了获取作者 - 隶属关系 - 电子邮件信息，我使用 cv2 来识别文本块，
然后过滤每个 cv2 框中的字符。
但现在它不起作用。
似乎 boxpoints 的 x/y（由 cv2 生成）与 pdfPlumber 生成的 x0/x1/top/bottom 不同。

这是我是如何做到的。任何帮助或意见将不胜感激。

def getCharsInBox(box):
    left_point_x=np.min(box[:,0])
    right_point_x=np.max(box[:,0])
    top_point_y=np.min(box[:,1])
    bottom_point_y=np.max(box[:,1])
    return lambda x:((x.get("x0",0)>=left_point_x &\
                      x.get("x0",0)<=right_point_x &\
                      x.get("top",0)>=top_point_x &\
                      x.get("bottom",0)<=bottom_point_y))


for box in region:
    filtered=page.filter(getCharsInBox(box))
    pdfAAE=filtered.extract_text()

Python PDF pdf水管工

def getCharsInBox(box):
    # img.shape=[width of img, heigh of image,other]
    left_point_x=np.min(box[:,0])*page.width/img.shape[1]
    right_point_x=np.max(box[:,0])*page.width/img.shape[1]
    top_point_y=np.min(box[:,1])*page.height/img.shape[0]
    bottom_point_y=np.max(box[:,1])*page.height/img.shape[0]
    return lambda x:((x.get("x0",0)>=left_point_x &\
                      x.get("x0",0)<=right_point_x &\
                      x.get("top",0)-croppedY>=top_point_x &\
                      x.get("bottom",0)-croppedY<=bottom_point_y))


for box in region:
    filtered=page.filter(getCharsInBox(box))
    pdfAAE=filtered.extract_text()

上一个：从 N 个点可以找到多少个三角形，其中有 N 个点的质心？

下一个：Lotus Notes - 快速将电子邮件移动到文件夹

如何使用pdfPlumber过滤特定区域内的文本并打开CV？

How to filter text within a certain area using pdfPlumber and open CV?

评论