解析邮政地址的 DataFrame - 删除国家/地区和单位编号-解网

问：

我有一个带有一列邮政地址的数据帧（生成 - 我用它来解析我的数据帧）。但是，的输出具有国家/地区名称 - 我不想要。它还包含单元号 - 我不想要。geopy.geocoders GoogleV3geolocator.geocode

我该怎么做？

我试过：

test_add['clean address'] = test_add.apply(lambda x: x['clean address'][:-5], axis = 1)

和

def remove_units(X):
    X = X.split()
    X_new = [x for x in X if not x.startswith("#")]
    return ' '.join(X_new)

test_add['parsed addresses'] = test_add['clean address'].apply(remove_units)

它适用于：

data = ["941 Thorpe St, Rock Springs, WY 82901, USA",
    "2809 Harris Dr, Antioch, CA 94509, USA",
    "7 Eucalyptus, Newport Coast, CA 92657, USA",
    "725 Mountain View St, Altadena, CA 91001, USA",
    "1966 Clinton Ave #234, Calexico, CA 92231, USA",
    "431 6th St, West Sacramento, CA 95605, USA",
    "5574 Old Goodrich Rd, Clarence, NY 14031, USA",
    "Valencia Way #1234, Valley Center, CA 92082, USA"]
test_df = pd.DataFrame(data, columns=['parsed addresses'])

但当我使用具有 150k 此类地址的较大数据帧时，出现错误：“AttributeError：'float' 对象没有属性'split'”。

最终，我只需要街道号码、街道名称、城市、州和邮政编码。

python pandas 字符串解析 geopy

def parse_address(address: str) -> str:
    # Remove the final comma separated entry assumed to be the country
    address_without_country = " ".join([x for x in address.split(",")[:-1]])
    
    return " ".join([x for x in address_without_country.split() 
                     if not x.startswith("#")])

def main():
    ...
    parsed_addresses = []
    for address in raw_addresses:
        # Either cast to string or try catch the case where not a string
        parsed_addresses.append(str(address))

1赞 PaulS 8/16/2023 #2

另一个可能的解决方案：

test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)

解释

\D表示非数字字符。
\D+表示一个或多个非数字字符
$表示字符串末尾
|表示逻辑 OR
\s表示空格字符
\d+表示一个或多个数字字符

有关正则表达式的更全面处理，请参阅正则表达式 HOWTO。

输出：

0       941 Thorpe St, Rock Springs, WY 82901
1           2809 Harris Dr, Antioch, CA 94509
2       7 Eucalyptus, Newport Coast, CA 92657
3    725 Mountain View St, Altadena, CA 91001
4        1966 Clinton Ave, Calexico, CA 92231
5       431 6th St, West Sacramento, CA 95605
6    5574 Old Goodrich Rd, Clarence, NY 14031
7       Valencia Way, Valley Center, CA 92082
Name: parsed addresses, dtype: object

解析邮政地址的 DataFrame - 删除国家/地区和单位编号

Parse DataFrame of postal addresses - remove country and unit number

评论

评论