基于正则表达式对文本进行分块，分隔符包括-解网

问：

我有一个很长的文本，大约 10k 个字符，包含许多部分。我需要根据这些部分对文本进行分块。每个块都应该包含一个部分。文本模板由标题以“SECTION|RUBRIQUE n“，其中 n 是截面的编号。

这是我的尝试：

import re

def get_text_chunks(text):
    section_pattern = r"(SECTION|RUBRIQUE) \d+: .+"
    section_headings = re.findall(section_pattern, text)
    chunks = re.split(section_pattern, text)

    return chunks

long_text = """
This text should be ignored.
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE

RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).

RUBRIQUE 2: HAZARDS IDENTIFICATION
2.2. Another Classification
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
"""

chunks = get_text_chunks(long_text)
for chunk in chunks:
    print(chunk)
    print("-----------------------")

但我得到这个输出：

This text should be ignored.
-----------------------
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
-----------------------
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
-----------------------
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------

而不是有这个输出：

RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION

2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------

PS：我的输入文本不以SECTION|从第一行开始的 RUBRIQUE。所以第一部分应该被忽略。

Python 正则表达式拆分

基于正则表达式对文本进行分块，分隔符包括

Chunk a text based on regex expression with delimiter including

评论

评论