基于正则表达式对文本进行分块,分隔符包括

Chunk a text based on regex expression with delimiter including

提问人:famas23 提问时间:9/6/2023 最后编辑:famas23 更新时间:10/16/2023 访问量:36

问:

我有一个很长的文本,大约 10k 个字符,包含许多部分。我需要根据这些部分对文本进行分块。每个块都应该包含一个部分。文本模板由标题以“SECTION|RUBRIQUE n“,其中 n 是截面的编号。

这是我的尝试:

import re

def get_text_chunks(text):
    section_pattern = r"(SECTION|RUBRIQUE) \d+: .+"
    section_headings = re.findall(section_pattern, text)
    chunks = re.split(section_pattern, text)

    return chunks

long_text = """
This text should be ignored.
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE

RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).

RUBRIQUE 2: HAZARDS IDENTIFICATION
2.2. Another Classification
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
"""

chunks = get_text_chunks(long_text)
for chunk in chunks:
    print(chunk)
    print("-----------------------")


但我得到这个输出:

This text should be ignored.
-----------------------
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
-----------------------
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
-----------------------
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------

而不是有这个输出:

RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION

2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------

PS:我的输入文本不以SECTION|从第一行开始的 RUBRIQUE。所以第一部分应该被忽略。

Python 正则表达式 拆分

评论


答:

1赞 trincot 9/6/2023 #1

您可以使用 look-ahead,以避免分隔符具有大小。此外,请勿使用捕获组,因为这些捕获组会导致输出列表中出现额外的元素:

    section_pattern = r"(?=(?:SECTION|RUBRIQUE) \d+: .+)"
    chunks = re.split(section_pattern, text)[1:]

With 您忽略分隔符第一次出现之前的文本。[1:]

评论

0赞 famas23 10/16/2023
@tricot,非常感谢您的帮助,您知道如何将部分组合在一起,导致当前解决方案,返回“RUBRIQUE 2”的 2 项,它们应该连接成一个项目。