数据帧中的字符串与具有多个匹配选项的其他数据帧的文本匹配-解网

问：

我有一个数据帧（df），其值位于“国家/地区”列中，我希望使用另一个名为“country_codes”的数据帧对其进行标准化。df 中的值可以与“country_codes”中的任何项目匹配，但生成的数据帧应包含相应的 country_code['country'] 值，即标准值。

该代码主要工作，并且确实返回标准国家/地区值，但正则表达式与确切的字符串不匹配。它匹配得太少（在本例中为：“Example1”）。

附加问题：是否有可能在最终输出数据帧中保留“年份”数据，而不指定名称“年份”，因为可能有多个浮点列。

下面显示了函数和所需的输出：

def match_country_codes(df, country_codes):
    # Create a regex pattern to match whole words
    pattern = '|'.join(rf'\b{re.escape(c)}\b' for c in country_codes[['country', 'alpha1', 'alpha2']].values.flatten())
    # new column for matches between pattern and df['country'] items
    df['matched_country'] = df['country'].str.extract(f'({pattern})', flags=re.IGNORECASE)
    # Merge with 'country_codes' dataframe to get the full country names
    # merge over 3 frames for all columns
    df1 = df.merge(country_codes, left_on='matched_country', right_on='country', how='left')
    df2 = df.merge(country_codes, left_on='matched_country', right_on='alpha1', how='left')
    df3 = df.merge(country_codes, left_on='matched_country', right_on='alpha2', how='left')

    dataframes = [df1, df2, df3]
    # merge all dataframes together on '[['country_y']]'
    result = reduce(merge_dataframes, dataframes)
    # Drop rows with None or NaN values in the 'country_y' column
    result = result.dropna(subset=['country_y'])
    # return result
    return result

示例数据帧：

df = pd.DataFrame({'country': ['foobar', 'foo and bar', 'Example1 and', 'PQR'], 
                  'year':[2018, 2019, 'NA',2017] 
                      })
country_codes = pd.DataFrame({'country': ['FooBar', 'Example1', 'foo and bar and foo', 'Example'],
                   'alpha1': ['foobar', 'Bosnia', 'ABC', 'DEF'],
                   'alpha2': ['GHI', 'JKL', 'MNO', 'PQR']             
                             })

输出：

result = match_country_codes(df, country_codes)
result

期望输出：

data = {'country_y': ['FooBar', 'Example']
           }

index_values = [0, 3]

desired_output = pd.DataFrame(data, index=index_values)
desired_output

谢谢

Python Pandas 正则表达式数据帧

df['country_y'] = (df
    .join(country_codes.set_index('country', drop=False), on='country', rsuffix='_1')
    .join(country_codes.set_index('alpha1'), on='country', rsuffix='_2')
    .join(country_codes.set_index('alpha2'), on='country', rsuffix='_3')
    [['country_1', 'country_2', 'country_3']]
    .groupby(lambda r: 'X', axis=1)
    .first()
)
df = df.dropna(subset='country_y')

对于此示例数据：

data = {
  'country': [None, 'foobar', 'foo and bar', 'Example1 and', 'PQR', 'Example', 'Bosnia', None, 'JKL', 'foobar'],
  'year': [None, None, 2018, 2019, 2017, 2020, 2017, None, 2019, 2019]
}
df = pd.DataFrame(data)

输入：

        country    year
0          None     NaN
1        foobar     NaN
2   foo and bar  2018.0
3  Example1 and  2019.0
4           PQR  2017.0
5       Example  2020.0
6        Bosnia  2017.0
7          None     NaN
8           JKL  2019.0
9        foobar  2019.0

输出将为：

   country    year country_y
1   foobar     NaN    FooBar
4      PQR  2017.0   Example
5  Example  2020.0   Example
6   Bosnia  2017.0  Example1
8      JKL  2019.0  Example1
9   foobar  2019.0    FooBar

谢谢，但如果有“NA”值，代码就会停止工作，即使它们被删除。''' df = pd。DataFrame（{'country'： [np.nan， np.nan， 'foobar'， 'foo and bar'， 'Example1 and'， 'PQR']， 'year'：[np.nan， np.nan， 2018， 2019， 'NA'，2017] }） country_codes = pd.DataFrame（{'country'： ['FooBar'， 'Example1'， 'foo and bar and foo'， 'Example']， 'alpha1'： ['foobar'， '波斯尼亚'， 'ABC'， 'DEF']， 'alpha2'： ['GHI'， 'JKL'， 'MNO'， 'PQR'] }） df = df.dropna（） '''

0赞 Benjamin Allen 9/30/2023

它现在可以工作了：首先删除 NA 并在 DF['country'] 中重置索引。谢谢！

0赞 Nick 9/30/2023

@BenjaminAllen是的，我现在正在查看它，这是中出现的索引排序问题，它似乎没有像文档所说的那样工作。您的建议绝对是一种解决方法，但我仍然会尝试弄清楚为什么会发生这种情况。join

0赞 Nick 9/30/2023

@BenjaminAllen很晚了，我早上再看一眼。我现在已经在答案中记下了这个问题。

0赞 Nick 10/1/2023

@BenjaminAllen问题不在于值，而在于 .如果先，代码将按预期工作。看我的编辑。NaNcountrydrop_duplicates

上一个：如何使用第一列中的值从 csv 文件中提取某些数据块

下一个：有没有办法使用 pandas str.replace 仅在单词单独出现时替换它，而不是作为较长字符串的一部分？

数据帧中的字符串与具有多个匹配选项的其他数据帧的文本匹配

Literal match of strings in dataframe to other dataframe with multiple match options

评论

评论