提问人:user2543622 提问时间:6/29/2022 最后编辑:user2543622 更新时间:6/30/2022 访问量:43
pandas 取代了 ndarray 的缩写
pandas replace abbrevations from ndarray
问:
我有一个,我想使用下面的字典替换其中的所有缩写?我怎样才能做到这一点,以便我得到的输出格式与输入相同。目前这就是我正在做的事情numpy.ndarray
X_trying=array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
dtype='<U97064')
X_trying
#notice double quotes
array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
dtype='<U97064')
df_for_abbreviations = pd.DataFrame(X_trying, columns = ['text'])#converting to a dataframe
df_for_abbreviations['text_lower']=df_for_abbreviations['text'].apply(lambda x:x.lower())#converting to lowercase so it works with dictionary
df_for_abbreviations["unabbreviated_text"] = df_for_abbreviations["text_lower"].replace(abbreviations_master, regex=True)
#then when i convert back to ndarray format gets screwed up - quotes change from double to single and it causes in donstream code
x=df_for_abbreviations['unabbreviated_text'].to_numpy(dtype='<U97064').reshape(df_for_abbreviations.shape[0],1)
x
#notice 引号更改为单引号
array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
dtype='<U97064')
single quotes affect affect the downstream output
我有一本词典,我想替换如下
abbreviations_master={}
abbreviations_master["i'm"]="i am"
abbreviations_master["it's"]="it is"
abbreviations_master["that's"]="that is"
abbreviations_master["don't"]="do not"
abbreviations_master["i'll"]="i will"
abbreviations_master["i've"]="i have"
abbreviations_master["we're"]="we are"
abbreviations_master["didn't"]="did not"
abbreviations_master["ma'am"]="madam"
abbreviations_master["you're"]="you are"
abbreviations_master["there's"]="there is "
abbreviations_master["let's"]="let us"
abbreviations_master["they're"]="they are"
abbreviations_master["can't"]="can not"
abbreviations_master["he's"]="he is"
abbreviations_master["doesn't"]="does not"
abbreviations_master["she's"]="she is"
abbreviations_master["what's"]="what is"
abbreviations_master["i'd"]="I would "
abbreviations_master["haven't"]="have not"
abbreviations_master["wasn't"]="was not"
abbreviations_master["we'll"]="we will"
abbreviations_master["won't"]="will not"
abbreviations_master["it'll"]="it will"
abbreviations_master["we've"]="we have"
abbreviations_master["wouldn't"]="would not"
abbreviations_master["that'd"]="that would "
abbreviations_master["you've"]="you have"
abbreviations_master["couldn't"]="could not"
abbreviations_master["that'll"]="that will"
abbreviations_master["y'all"]="you all"
abbreviations_master["isn't"]="is not"
abbreviations_master["it'd"]="it would"
abbreviations_master["would've"]="would have"
abbreviations_master["'cause"]="because"
abbreviations_master["hasn't"]="has not"
abbreviations_master["they've"]="they have"
abbreviations_master["you'll"]="you will"
abbreviations_master["here's"]="here is"
abbreviations_master["name's"]="name is"
abbreviations_master["shouldn't"]="should not"
abbreviations_master["wife's"]="?"
abbreviations_master["driver's"]="?"
abbreviations_master["they'll"]="they will"
abbreviations_master["everything's"]="?"
abbreviations_master["husband's"]="?"
abbreviations_master["there'll"]="there will"
abbreviations_master["should've"]="should have"
abbreviations_master["we'd"]="we would"
abbreviations_master["'bout"]="about"
abbreviations_master["she'll"]="she will"
abbreviations_master["he'll"]="he will"
abbreviations_master["you'd"]="you would"
abbreviations_master["one's"]="?"
abbreviations_master["who's"]="who has"
abbreviations_master["weren't"]="were not"
abbreviations_master["aren't"]="are not"
abbreviations_master["how's"]="how is"
abbreviations_master["how're"]="how are"
abbreviations_master["hadn't"]="had not"
答:
1赞
nonDucor
6/30/2022
#1
您可以使用来断开单词中的输入,同时保留分隔符(因为您的一些示例以 a 开头),并检查字典中是否有任何单词,否则,只需保留该单词。下面的代码不是很优雅,因为您的输入是 .如果可以将其设置为简单的字符串列表,则可以简化代码。re.split
np.array
import re
import numpy as np
output_array = []
for input_line in X_trying:
output_array.append([''.join(abbreviations_master[word] if word in abbreviations_master else word
for word in re.split('( )', str(input_line[0]).lower()))])
output_array = np.array(output_array, dtype='<U97064')
输出格式与输入类似:
array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
dtype='<U97064')
请注意,in 很重要。如果有更多分隔符,可以将它们添加为:。但是你的例子没有任何其他标点符号,所以我没有添加它。()
split
re.split('( |\.|,)
评论
0赞
user2543622
6/30/2022
为什么是单引号而不是双引号?
1赞
nonDucor
6/30/2022
@user2543622,这只是显示差异。单引号和双引号在语法上都是有效的,并且具有相同的含义。Python 通常更喜欢使用单引号进行显示,但如果字符串包含单引号(如第一个示例),它将显示用双引号括起来。
评论