提问人:Minions 提问时间:6/25/2021 最后编辑:Minions 更新时间:6/25/2021 访问量:68
检测句子中引用的文本
Detection of quoted text in sentences
问:
我有引用文本的句子,例如:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?
我正在尝试用 REGEX 掩盖引用的部分,但它并不准确。例如,对于最后一句话:
txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))
输出为:
Reread these sentences: "<quote>" mean?
相反,它应该是:
Reread these sentences: "<quote>" What does the word "courtship" mean?
由于我有 > 10k 个实例,因此很难找到适用于所有情况的通用 REGEX 模式。
我的问题是,是否有任何库(可能基于神经网络实现?)或方法来解决这个问题?
答:
1赞
Ryszard Czech
6/25/2021
#1
对于这些示例,请使用
import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)
请参阅 Python 证明。对于各种类型的引号,请使用单独的命令,这样可以更轻松地控制。
结果:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?
0赞
Edoardo Facchinelli
6/25/2021
#2
另一种方法是使用与正则表达式完全不同的技术,shlex
shlex 类可以很容易地编写词法分析器,以获得类似于 Unix shell 的简单语法。这通常有助于编写微型语言(例如,在 Python 应用程序的运行控制文件中)或解析带引号的字符串。
shlex.split
拆分为单词时考虑引号,可选参数将引号保留在结果中。使用其输出,您可以创建一个类似于您描述的字符串。posix
import shlex
lines = [
'Why did the author use three sentences in a row that start with the words, "it spun"?',
'Why did the queen most likely say “I would have tea instead.”',
'Why did the fdsfdsf repeat the phrase "he waited" so many times?',
'Why were "the lights of his town growing smaller below them"?',
'What is a fdsfdsf for the word "adjust"?', 'Reread this: "If anybody had asked trial of answered at once, \'My nose.\'" What is the correct definition of the word "trial" as it is used here?',
'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?',
]
for line in lines:
print(
" ".join(
word
if word[0] != '"' and word[-1] != '"' else '"<quote>"'
for word in shlex.split(line, posix=False)
)
)
输出:
Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "<quote>" as it is used here?
Reread these sentences: "<quote>" What does the word "<quote>" mean?
- 注 1:不会将大括号解释为引号(例如第 2 行),因此,如果您有引号,则在将每行都输入到引号之前,您应该先将其输入。
shlex
.replace()
- 注 2:这是替换所有引用的出现,但如果你只想要第一个并保留其余的,你可以这样做(很确定这可以写得更好,但把它当作概念证明):
for line in lines:
new_line = []
quote_count = 0
for word in shlex.split(line, posix=False):
if word[0] == '"' and word[-1] == '"':
if quote_count < 1:
quote_count += 1
new_line.append('"<quote>"')
else:
new_line.append(word)
else:
new_line.append(word)
print(' '.join(new_line))
输出:
Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "<quote>" What does the word "courtship" mean?
评论