提问人:famas23 提问时间:9/6/2023 最后编辑:famas23 更新时间:10/16/2023 访问量:36
基于正则表达式对文本进行分块,分隔符包括
Chunk a text based on regex expression with delimiter including
问:
我有一个很长的文本,大约 10k 个字符,包含许多部分。我需要根据这些部分对文本进行分块。每个块都应该包含一个部分。文本模板由标题以“SECTION|RUBRIQUE n“,其中 n 是截面的编号。
这是我的尝试:
import re
def get_text_chunks(text):
section_pattern = r"(SECTION|RUBRIQUE) \d+: .+"
section_headings = re.findall(section_pattern, text)
chunks = re.split(section_pattern, text)
return chunks
long_text = """
This text should be ignored.
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
RUBRIQUE 2: HAZARDS IDENTIFICATION
2.2. Another Classification
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
"""
chunks = get_text_chunks(long_text)
for chunk in chunks:
print(chunk)
print("-----------------------")
但我得到这个输出:
This text should be ignored.
-----------------------
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
-----------------------
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
-----------------------
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------
而不是有这个输出:
RUBRIQUE 1: IDENTIFICATION OF THE SUBSTANCE/MIXTURE AND OF THE COMPANY/UNDERTAKING
1.1. Product identifier
Product name: - SANITARY DEODORIZING DESCALING CLEANER
Product code: DUOBAC SANIT
1.2. Applicable uses of the substance or mixture and uses advised against
SANITARY HYGIENE
-----------------------
RUBRIQUE 2: HAZARDS IDENTIFICATION
2.1. Classification of the substance or mixture
In accordance with Regulation (EC) No. 1272/2008 and its adaptations.
Skin corrosion, Category 1B (Skin Corr. 1B, H31 4).
-----------------------
PS:我的输入文本不以SECTION|从第一行开始的 RUBRIQUE。所以第一部分应该被忽略。
答:
1赞
trincot
9/6/2023
#1
您可以使用 look-ahead,以避免分隔符具有大小。此外,请勿使用捕获组,因为这些捕获组会导致输出列表中出现额外的元素:
section_pattern = r"(?=(?:SECTION|RUBRIQUE) \d+: .+)"
chunks = re.split(section_pattern, text)[1:]
With 您忽略分隔符第一次出现之前的文本。[1:]
评论
0赞
famas23
10/16/2023
@tricot,非常感谢您的帮助,您知道如何将部分组合在一起,导致当前解决方案,返回“RUBRIQUE 2”的 2 项,它们应该连接成一个项目。
上一个:使用正则表达式将字符行拆分为列
下一个:从正则表达式输出创建列表列表
评论