检测句子中引用的文本

Detection of quoted text in sentences

提问人:Minions 提问时间:6/25/2021 最后编辑:Minions 更新时间:6/25/2021 访问量:68

问:

我有引用文本的句子,例如:

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?

我正在尝试用 REGEX 掩盖引用的部分,但它并不准确。例如,对于最后一句话:

txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))

输出为:

Reread these sentences: "<quote>" mean?

相反,它应该是:

Reread these sentences: "<quote>" What does the word "courtship" mean?

由于我有 > 10k 个实例,因此很难找到适用于所有情况的通用 REGEX 模式。

我的问题是,是否有任何库(可能基于神经网络实现?)或方法来解决这个问题?

Python 正则表达 机器学习 行情

评论

1赞 Tim Roberts 6/25/2021
您需要将此类问题视为“匹配一个引号,然后是任意数量的非引号,然后是引号”。如果你把它看作是“匹配一个引号,然后是任何东西,然后是另一个引号”,你会因为正则表达式的贪婪而失败。
0赞 Minions 6/25/2021
@anubhava,它对此不起作用:“重读这句话:”如果有人立即要求审判,'我的鼻子'。这里使用的“审判”一词的正确定义是什么?
0赞 Minions 6/25/2021
@TimRoberts对不起,不清楚。你能澄清一下你的答案吗?
0赞 Minions 6/25/2021
@anubhava,当它是用双引号括起来的文本时。

答:

1赞 Ryszard Czech 6/25/2021 #1

对于这些示例,请使用

import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)

请参阅 Python 证明。对于各种类型的引号,请使用单独的命令,这样可以更轻松地控制。

结果

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?
0赞 Edoardo Facchinelli 6/25/2021 #2

另一种方法是使用与正则表达式完全不同的技术,shlex

shlex 类可以很容易地编写词法分析器,以获得类似于 Unix shell 的简单语法。这通常有助于编写微型语言(例如,在 Python 应用程序的运行控制文件中)或解析带引号的字符串。

shlex.split拆分为单词时考虑引号,可选参数将引号保留在结果中。使用其输出,您可以创建一个类似于您描述的字符串。posix

import shlex

lines = [
'Why did the author use three sentences in a row that start with the words, "it spun"?',
'Why did the queen most likely say  “I would have tea instead.”',
'Why did the fdsfdsf repeat the phrase "he waited" so many times?',
'Why were "the lights of his town growing smaller below them"?',
'What is a fdsfdsf for the word "adjust"?', 'Reread this: "If anybody had asked trial of answered at once, \'My nose.\'" What is the correct definition of the word "trial" as it is used here?',
'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?',
]
for line in lines:
    print(
        " ".join(
            word
            if word[0] != '"' and word[-1] != '"' else '"<quote>"'
            for word in shlex.split(line, posix=False)
        )
    )

输出:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "<quote>" as it is used here?
Reread these sentences: "<quote>" What does the word "<quote>" mean?
  • 注 1:不会将大括号解释为引号(例如第 2 行),因此,如果您有引号,则在将每行都输入到引号之前,您应该先将其输入。shlex.replace()
  • 注 2:这是替换所有引用的出现,但如果你只想要第一个并保留其余的,你可以这样做(很确定这可以写得更好,但把它当作概念证明):
for line in lines:
    new_line = []
    quote_count = 0
    for word in shlex.split(line, posix=False):
        if word[0] == '"' and word[-1] == '"':
            if quote_count < 1:
                quote_count += 1
                new_line.append('"<quote>"')
            else:
                new_line.append(word)
        else:
            new_line.append(word)
    print(' '.join(new_line))

输出:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "<quote>" What does the word "courtship" mean?