在正则表达式匹配项中合并间隔-解网

问：

我试图使用 python 在文本中查找所有类型的参考资料，例如“附录 2”、“第 17 节”或“附表 12.2”。找到此类匹配项后的问题是其中一些重叠，我想将它们加入一个新字符串中，或者只考虑最长的字符串，删除子字符串。

为此，我创建了多个正则表达式模式，使代码更具可读性，然后将它们插入到列表中，对列表中的所有模式调用 finditer。从匹配中，我收集文本和文本中的位置作为开始和结束索引。

def get_references(text):
    refs = [{
        'text': match.group(),
        'span': { 
            'start': match.span()[0],
            'end': match.span()[1]
    }} 
        for ref in references_regex for match in finditer(ref, text)]

这意味着，尽管相同或变化不大（例如“本书第 17.4 节”和“本书第 17.4 节”和“本书第 17.4 节”和“本书第 17.4 节”），但仍会在结果中多次插入由多个模式匹配的引用。

我尝试将重叠模式与一些临时函数合并，但仍然无法正常工作。

您知道是否有办法删除重复项或合并它们以防它们重叠吗？

例如，我有：

[{"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
 {"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
 {"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]

我想得到：

 {"text": "Schedule 15.1 of the Framework Agreement", "span": {"start": 756, "end": 796}},
 {"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]

先谢谢你！

python 重叠匹配

def process(match_list):
    if not match_list:
        return []

    new_list = []
    new_text = match_list[0]['text']
    start, end = match_list[0]['span']['start'], match_list[0]['span']['end']

    for i in range(1, len(match_list)):
        # If overlap
        if end >= match_list[i]['span']['start']:
            # Merge the text and update the ending position
            new_text += match_list[i]['text'][end-match_list[i]['span']['start']-1:]
            end = max(end, match_list[i]['span']['end'])
        else:
            # If not overlap, append the text to the result
            new_list.append({'text': new_text, 'span': {'start': start, 'end': end}})
            # Process the next text
            new_text = match_list[i]['text']
            start, end = match_list[i]['span']['start'], match_list[i]['span']['end']

    # Append the last text in the list
    new_list.append({'text': new_text, 'span': {'start': start, 'end': end}})
    return new_list

def get_s_e(x):
    s, e = map(x['span'].get, ['start', 'end'])
    return s, e


def concat_dict(a):
    a = sorted(a, key=lambda x: x['span']['start'], reverse=True)

    index = 0
    while index < len(a):
        cur = a[index]
        try:
            nxt = a[index+1]
        except:
            break
        cur_st, cur_end = get_s_e(cur)
        nxt_st, nxt_end = get_s_e(nxt)

        if cur_st <= nxt_end:
            join_index = cur_st-nxt_st

            if nxt_end >= cur_end:
                text = nxt['text']
                a[index]['span']['end'] = nxt_end
            else:
                text = n['text'][:join_index]+cur['text']

            a[index]['text'] = text
            a[index]['span']['start'] = nxt_st

            del a[index+1]
        else:
            index += 1

    return a

a = [{"text": "Book bf dj Schedule 15.1 of the", "span": {"start": 745, "end": 776}},
     {"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
     {"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
     {"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}}
    ]
print(concat_dict(a))

输出：

[{'text': '17.14 of the book', 'span': {'start': 1883, 'end': 1900}},
 {'text': 'Book bf dj Book bf d15.1 of the Framework Agreement',
  'span': {'start': 745, 'end': 796}}]

上一个：Python 3 正则表达式 - 在字符串中查找所有重叠匹配项的开始和结束索引

下一个：TypeError：带有 ORB_create（）的 self 类型不正确（必须是“DescriptorMatcher”或其派生词）

在正则表达式匹配项中合并间隔

Merging intervals in regex matches

评论

评论