Python 正则表达式在第二组中没有其他匹配项时规避可选的非捕获组-解网

问：

我正在尝试清理一个文件，其中有一些用项目名称写的 uoms，中间没有任何空格。我想出了一个正则表达式来匹配 uom 及其变体，它本身工作正常，但是当分组以捕获项目名称时，它无法给出预期的输出。

import re

uom_regex = 'box(?:es)?|bxs|bag(?:s)?'
test_text = ["box", "boxes", "bag", "bags", "bxs"]

for text in test_text:
    match = re.search(uom_regex, text)
    print(match.group() if match else "No match")

这个正则表达式本身运行良好，并完美地捕获了所有 uom。

但是，当我尝试将相同的正则表达式与其他部分结合使用以在它们到期的地方添加空间时，它工作正常，除非该单词实际上没有任何额外内容（例如以下示例的前 2 种情况

import re

uom_regex = 'box(?:es)?|bxs|bag(?:s)?'
regex = r'({0})([a-zA-Z]+)'.format(uom_regex)

test_strings = ["boxes", "bags", "boxesapple", "boxapple", "bagapple", 'bagsapple']

for test_string in test_strings:
    result = re.sub(regex, r'\1 \2', test_string)
    print(f"Original: {test_string}")
    print(f"Modified: {result}\n")

这是输出。

Original: boxes
Modified: box es

Original: bags
Modified: bag s

Original: boxesapple
Modified: boxes apple

Original: boxapple
Modified: box apple

Original: bagapple
Modified: bag apple

Original: bagsapple
Modified: bags apple

然而，前 2 个输出应该是这样的。

Original: boxes
Modified: boxes

Original: bags
Modified: bags

python-3.x 正则表达式

import re
 
uom_regex = 'box(?:es)?|bxs|bags?'
regex = r'(?=({0}))\1([a-zA-Z]+)'.format(uom_regex)
 
test_strings = ["boxes", "bags", "boxesapple", "boxapple", "bagapple", 'bagsapple']
 
for test_string in test_strings:
    result = re.sub(regex, r'\1 \2', test_string)
    print(f"Original: {test_string}")
    print(f"Modified: {result}\n")

观看 Python 演示

输出：

Original: boxes
Modified: boxes

Original: bags
Modified: bags

Original: boxesapple
Modified: boxes apple

Original: boxapple
Modified: box apple

Original: bagapple
Modified: bag apple

Original: bagsapple
Modified: bags apple

0赞 Cary Swoveland 11/13/2023 #2

如果使用 Python 的 PyPI 正则表达式模块（类似于 PCRE），则可以将以下正则表达式的匹配项替换为空格。

\b(?:box(?:(?!es)|es)|bag(?:(?!s)|s)|bxs)\K(?=[a-zA-Z])

演示

此表达式具有以下元素。

\b           # match a word boundary
(?:          # begin (outer) non-capture group
  box        # match literal
  (?:        # begin non-capture group
    (?!es)   # negative lookahead asserts next two chars are not 'es'
  |          # or
    es       # match literal
  )          # end non-capture group
|            # or
  bag        # match literal
  (?:        # begin non-capture group
    (?!s)    # negative lookahead asserts next char is not 's'
  |          # or
    s        # match literal
  )          # end non-capture group
|            # or
  bxs        # match literal
)            # end (outer) non-capture group
\K           # reset start of match and discard previously-consumed chars
(?=[a-zA-Z]) # positive lookahead asserts next char is a letter

上一个：解析 PDF 以获取地址和日期之间的文本

下一个：使用正则表达式 Python 3 指定多个匹配“长度”

Python 正则表达式在第二组中没有其他匹配项时规避可选的非捕获组

Python regex evading optional non capturing group when there's nothing else to match in the second group

评论