正则表达式以匹配由字母、数字和撇号组成的任何字符串。但排除下划线

Regular Expression to match any string consisting of letters,digits and apostrophes. But exclude underscores

提问人：Sven Madson 提问时间：10/31/2023 最后编辑：tripleeeSven Madson 更新时间：10/31/2023 访问量：36

问：

我正在拼命寻找 Python 中的正则表达式，只使用该库。表达式应将文本分成单个字符串，由来自任何语言的字母、数字和撇号组成。re

例如：

"Test, this is a String 123!"应该是：["Test", "this", "is", "a", "String", "123"]
"λάκιäsañ"应该是：["λάκιäsañ"]
"not_underscores"应该是：["not", "underscores"]

到目前为止，我尝试过：

[\w\']+

这是一个字符串，如未划分为 ."included_Underscores""included""underscores"

([^\W_]+(?:\')*)

这是应该识别的所有内容，但之后的字符是分开的"'"

([\w\'](?<=[^_]))+

在这里，单词的最后一个字符被分开

表达式 python-re

答：

0赞 entorb 10/31/2023 #1

怎么样

import re


def splitIt(line: str) -> list[str]:
    line = re.sub(r"[_]", " ", line)
    line = re.sub(r"[^\w']+$", "", line)
    line = re.sub(r"^[^\w']+", "", line)
    res = re.split(r"[^\w']+", line)
    return res


assert splitIt("Test, this is a String 123") == [
    "Test",
    "this",
    "is",
    "a",
    "String",
    "123",
]

assert splitIt("asdf λάκιäsañ 123") == [
    "asdf",
    "λάκιäsañ",
    "123",
]

assert splitIt("not_underscores") == [
    "not",
    "underscores",
]


assert splitIt("not' underscores") == [
    "not'",
    "underscores",
]

1赞 tripleee 10/31/2023 #2

\w包括下划线。如果你想要一个不同的定义，你需要把它拼出来。幸运的是，在这种情况下，很容易定义由任何非下划线字符、非下划线字符或撇号组成的补语。\w

re.findall(r"(?:[^\W_]|')+", text)

演示：

>>> import re
>>> re.findall(r"(?:[^\W_]|')+", "Test, this is a String 123!")
['Test', 'this', 'is', 'a', 'String', '123']
>>> re.findall(r"(?:[^\W_]|')+", "λάκιäsañ")
['λάκιäsañ']
>>> re.findall(r"(?:[^\W_]|')+", "not_underscores")
['not', 'underscores']
>>> re.findall(r"(?:[^\W_]|')+", "don't worry, be happy")
["don't", 'worry', 'be', 'happy']

一个明显的缺点是字符串周围的单引号也将包括在内。

>>> re.findall(r"(?:[^\W_]|')+", "'scare quotes' are scary")
["'scare", "quotes'", 'are', 'scary']

有时它们也正确地成为单词的一部分。

>>> re.findall(r"(?:[^\W_]|')+", "vi flytt' int'")
['vi', "flytt'", "int'"]

0赞 tripleee 10/31/2023

参考文献 forum.wordreference.com/threads/swedish-vi-flytt-int.912301

上一个：Reguar 表达式从 python 获取单行和多行

下一个：查找 ipv6 前缀。为什么我的答案是错误的 [已关闭]

正则表达式以匹配由字母、数字和撇号组成的任何字符串。但排除下划线

Regular Expression to match any string consisting of letters,digits and apostrophes. But exclude underscores

评论

评论