从单词列表中删除引号和双引号

removing quotes and double quotes from a list of words

提问人:Safraz 提问时间:8/29/2021 最后编辑:Mark TolonenSafraz 更新时间:8/30/2021 访问量:350

问:

这是我在这个网站上的第一个问题。请原谅我的任何格式或语言错误。所以这个问题是基于艾伦·唐尼(Allen Downey)的一本名为《思考蟒蛇》的书。该活动是编写一个 python 程序,该程序以文本格式读取一本书并删除所有空格,例如空格、制表符、标点符号和其他符号。我尝试了许多不同的方法来删除标点符号,但它从未删除引号和双引号。他们坚持不懈地留下来。我将复制粘贴我尝试的最后一个代码。

import string

def del_punctuation(item):
    '''
        This function deletes punctuation from a word.
    '''
    punctuation = string.punctuation
    for c in item:
        if c in punctuation:
            item = item.replace(c, '')
    return item

def break_into_words(filename):
    '''
        This function reads file, breaks it into 
        a list of used words in lower case.
    '''
    book = open(filename)
    words_list = []
    for line in book:
        for item in line.split():
            item = del_punctuation(item)
            item=item.lower()
            #print(item)
            words_list.append(item)
    return words_list

print(break_into_words('input.txt'))

我没有包含删除空格的代码,因为它们可以完美地工作。我只包含用于删除标点符号的代码。除引号和双引号外,所有标点符号均被删除。请帮助我找到代码中的错误,还是与我的IDE或编译器有关? 提前致谢

输入 .txt:

“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”

“What is his name?”

“Bingley.”

“Is he married or single?”

“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”

“How so? how can it affect them?”

“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”

“Is that his design in settling here?”

我得到的输出复制如下:

['“为什么', '我的', '亲爱的', '你', '必须', '知道', '夫人', '长', '说', '那个', '内瑟菲尔德', '是', '被'拿走', '由', 'a', '年轻', '男人', 'of', '大', '财富', '从', 'the', '北方', 'of', '英格兰', '那个', '他', '来了', '下来', 'on', '星期一', '在', 'a', '躺椅', '和', '四', '到', '看', 'the', 'place', 'and', 'was', 'so', 'much', 'delighted', 'with', 'with', 'that', 'he', 'agree', 'with', 'mr', 'morris', 'immediately', 'that', 'he', '是', '到', '拿', '占有', '之前', '迈克尔马斯', '和', '一些', '的', '他的', '仆人', '是', '到', '是', '是', '在', '的', '房子', '由', '的', '结束', '的', '下一个', '周', '什么', '是', '他的', '名字', '宾利', '是', '他', '已婚', '或', '单身', '哦', '单身', '我的', '亲爱的', '到', '是', '当然', 'a', '单身', '男人', 'of', '大', '财富', '四', '或', '五', '千', 'a', 'a', '年', '什么', 'a', '好', '事情', '为', '我们的', '姑娘', '如何', '所以', '如何', '可以', '它', '影响', '他们', '我的', '亲爱的', '先生', '班纳特', '回答', '他的', '妻子', '如何', '可以', '你', '是', '所以', '令人厌烦', '你', '必须', '知道', '那个', '我', '我', '想', '的', '他的', '结婚', '一个', '的', '他们', '是', '那个', '他的', '设计', '在', '定居', '这里']

它删除了除双引号和单引号之外的所有标点符号(我猜输入中有单引号)。 谢谢

python 字符串 调试 替换 引号

评论

1赞 Finomnis 8/29/2021
欢迎来到stackoverflow!虽然您的示例已经非常小,这很好,但它仍然缺少示例输入以及预期和实际输出。否则很难确切地帮助您,因为我们必须猜测到底要发生什么。有关更多信息,请阅读有关最小可重现示例的页面
1赞 dcbaker 8/29/2021
您使用的输入文本中是否包含“智能引号”?那些有角度的引号不在 .这些是文字处理器倾向于插入的有角度的引号。string.punctuation
0赞 Patrick Artner 8/29/2021
如果你“调试”你的代码并“检查”它 - 你的IDE将始终显示字符串的“或”广告开始/结束 - 以明确它是一个字符串。是你说的那些吗? 您的项目,并查看您是否在 cosole 输出中看到它们print()
0赞 Finomnis 8/29/2021
另外,请使用 而不是 .with open(...) asopen
0赞 Safraz 8/29/2021
您好,非常感谢您的建议和意见。我对我的问题做了很多改进和更改。请帮帮我。

答:

1赞 shakiba.mrd 8/29/2021 #1

我认为您的文本包含此字符“作为双引号而不是”.“在string.punctuation中不存在,因此您不会删除它。也许最好稍微改变一下你的del_punctuation功能:

def del_punctuation(item):
    '''
        This function deletes punctuation from a word.
    '''
    punctuation = string.punctuation
    for c in item:
        if c in punctuation:
            item = item.replace(c, '')
        
    item = item.replace('”','')
    item = item.replace('“','')
    return item

评论

1赞 Yuri Khristich 8/29/2021
就是所谓的右引号,但左引号的数量大致相同(有细微的区别)。因此,您至少需要再添加一个替换行。而且,这只是真正问题的开始。
0赞 shakiba.mrd 8/30/2021
谢谢你的评论。你是对的!我编辑了我的帖子以替换开头和结尾的引号,但我认为您的答案更笼统,更好的方法是只保留字母。@YuriKhristich
2赞 Yuri Khristich 8/29/2021 #2

真正的文本可能包含太多棘手的符号:n-dash , m-dash , 十多个不同的引号 “ ' ' ' ' ” “ « » ‹› 等等,等等......

尝试计算所有可能的标点符号是没有意义的。常见的方法是尽量只获取字母(和空格)。最简单的方法是使用正则表达式:

import re

text = '''“Why, my dear, you must know, Mrs. Long says that Netherfield is
taken by a young man of large fortune from the north of England;
that he came down on Monday in a chaise and four to see the
place, and was so much delighted with it that he agreed with Mr.
Morris immediately; that he is to take possession before
Michaelmas, and some of his servants are to be in the house by
the end of next week.”

“What is his name?”

“Bingley.”

“Is he married or single?”

“Oh! single, my dear, to be sure! A single man of large fortune;
four or five thousand a year. What a fine thing for our girls!”

“How so? how can it affect them?”

“My dear Mr. Bennet,” replied his wife, “how can you be so
tiresome! You must know that I am thinking of his marrying one of
them.”

“Is that his design in settling here?”'''

# remove everything except letters, spaces, \n and, for example, dashes
text = re.sub("[^A-z \n\-]", "", text)

# split the text by spaces and \n
output = text.split()

print(output)

但实际上,事情比乍一看要复杂得多。说是两个字?大概是这样。怎么样?或。I'msomeone'srock'n'roll