提问人:Safraz 提问时间:8/29/2021 最后编辑:Mark TolonenSafraz 更新时间:8/30/2021 访问量:350
从单词列表中删除引号和双引号
removing quotes and double quotes from a list of words
问:
这是我在这个网站上的第一个问题。请原谅我的任何格式或语言错误。所以这个问题是基于艾伦·唐尼(Allen Downey)的一本名为《思考蟒蛇》的书。该活动是编写一个 python 程序,该程序以文本格式读取一本书并删除所有空格,例如空格、制表符、标点符号和其他符号。我尝试了许多不同的方法来删除标点符号,但它从未删除引号和双引号。他们坚持不懈地留下来。我将复制粘贴我尝试的最后一个代码。
import string
def del_punctuation(item):
'''
This function deletes punctuation from a word.
'''
punctuation = string.punctuation
for c in item:
if c in punctuation:
item = item.replace(c, '')
return item
def break_into_words(filename):
'''
This function reads file, breaks it into
a list of used words in lower case.
'''
book = open(filename)
words_list = []
for line in book:
for item in line.split():
item = del_punctuation(item)
item=item.lower()
#print(item)
words_list.append(item)
return words_list
print(break_into_words('input.txt'))
我没有包含删除空格的代码,因为它们可以完美地工作。我只包含用于删除标点符号的代码。除引号和双引号外,所有标点符号均被删除。请帮助我找到代码中的错误,还是与我的IDE或编译器有关? 提前致谢
输入 .txt:
“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”
“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”
“How so? how can it affect them?”
“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”
“Is that his design in settling here?”
我得到的输出复制如下:
['“为什么', '我的', '亲爱的', '你', '必须', '知道', '夫人', '长', '说', '那个', '内瑟菲尔德', '是', '被'拿走', '由', 'a', '年轻', '男人', 'of', '大', '财富', '从', 'the', '北方', 'of', '英格兰', '那个', '他', '来了', '下来', 'on', '星期一', '在', 'a', '躺椅', '和', '四', '到', '看', 'the', 'place', 'and', 'was', 'so', 'much', 'delighted', 'with', 'with', 'that', 'he', 'agree', 'with', 'mr', 'morris', 'immediately', 'that', 'he', '是', '到', '拿', '占有', '之前', '迈克尔马斯', '和', '一些', '的', '他的', '仆人', '是', '到', '是', '是', '在', '的', '房子', '由', '的', '结束', '的', '下一个', '周', '什么', '是', '他的', '名字', '宾利', '是', '他', '已婚', '或', '单身', '哦', '单身', '我的', '亲爱的', '到', '是', '当然', 'a', '单身', '男人', 'of', '大', '财富', '四', '或', '五', '千', 'a', 'a', '年', '什么', 'a', '好', '事情', '为', '我们的', '姑娘', '如何', '所以', '如何', '可以', '它', '影响', '他们', '我的', '亲爱的', '先生', '班纳特', '回答', '他的', '妻子', '如何', '可以', '你', '是', '所以', '令人厌烦', '你', '必须', '知道', '那个', '我', '我', '想', '的', '他的', '结婚', '一个', '的', '他们', '是', '那个', '他的', '设计', '在', '定居', '这里']
它删除了除双引号和单引号之外的所有标点符号(我猜输入中有单引号)。 谢谢
答:
我认为您的文本包含此字符“作为双引号而不是”.“在string.punctuation中不存在,因此您不会删除它。也许最好稍微改变一下你的del_punctuation功能:
def del_punctuation(item):
'''
This function deletes punctuation from a word.
'''
punctuation = string.punctuation
for c in item:
if c in punctuation:
item = item.replace(c, '')
item = item.replace('”','')
item = item.replace('“','')
return item
评论
”
就是所谓的右引号,但左引号的数量大致相同(有细微的区别)。因此,您至少需要再添加一个替换行。而且,这只是真正问题的开始。“
真正的文本可能包含太多棘手的符号:n-dash , m-dash , 十多个不同的引号 “ ' ' ' ' ” “ « » ‹› 等等,等等......–
—
尝试计算所有可能的标点符号是没有意义的。常见的方法是尽量只获取字母(和空格)。最简单的方法是使用正则表达式:
import re
text = '''“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”
“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”
“How so? how can it affect them?”
“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”
“Is that his design in settling here?”'''
# remove everything except letters, spaces, \n and, for example, dashes
text = re.sub("[^A-z \n\-]", "", text)
# split the text by spaces and \n
output = text.split()
print(output)
但实际上,事情比乍一看要复杂得多。说是两个字?大概是这样。怎么样?或。I'm
someone's
rock'n'roll
评论
string.punctuation
print()
with open(...) as
open