如何将一列的子字符串与另一列的值矢量化 - pandas

How to change substring from one column with value from another column vectorized - pandas

提问人:urosdigital 提问时间:11/9/2023 最后编辑:urosdigital 更新时间:11/9/2023 访问量:19

问:

我有一个数据帧,其中一列是 xml 字符串 (XML),其中一列 (ICCID) 包含需要用于替换每行 xml 列中的子字符串的值。如果可能的话,我想做这个矢量化,所以我尝试了以下代码:

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)

columns = ['MSISDN', 'XML', 'ICCID', 'IMSI']
data = [['0123456789', '<subscriberInfo><msisdn>0123456789</msisdn><iccId>12345678998765432100</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410123456789123456, 112233445566778],
['9876543210', '<subscriberInfo><msisdn>9876543210</msisdn><iccId>98765432112365478900</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410987654321987456, 228024357302211],
['0123987456', '<subscriberInfo><msisdn>0123987456</msisdn><iccId>98765432198765432100</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410987654321098765, 228024357302212]]

df = pd.DataFrame(data=data, columns=columns)
df['NEW_XML'] = df['XML'].replace(to_replace=[r'<iccId>\d{20}</iccId>'], value=[fr'<iccId>{df["ICCID"]}</iccId>'], regex=True)

此方法不起作用,因为 xml 字符串中的目标部分/子字符串已替换为 Series 的实际 pandas 表示形式,如下所示:

0 89410123456789123456\n1 89410987654321987456\n2 89410987654321098765\nName: ICCID, dtype: object

NEW_XML列的预期结果如下所示:

NEW_XML
'<subscriberInfo><msisdn>0123456789</msisdn><iccId>89410123456789123456</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
'<subscriberInfo><msisdn>9876543210</msisdn><iccId>89410987654321987456</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
'<subscriberInfo><msisdn>0123987456</msisdn><iccId>89410987654321098765</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
pandas dataframe 替换 子字符串

评论


答:

0赞 Corralien 11/9/2023 #1

你可以使用 str.extract

pat = r'(?P<before>.*<iccId>)(?P<iccid>\d{20})(?P<after></iccId>.*)'
xml = df['XML'].str.extract(pat).assign(iccid=df['ICCID'].astype(str))
df['NEW_XML'] = xml['before'] + xml['iccid'] + xml['after']

输出:

>>> df['NEW XML']
0    <subscriberInfo><msisdn>0123456789</msisdn><iccId>89410123456789123456</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
1    <subscriberInfo><msisdn>9876543210</msisdn><iccId>89410987654321987456</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
2    <subscriberInfo><msisdn>0123987456</msisdn><iccId>89410987654321098765</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
Name: NEW_XML, dtype: object

详:

>>> xml
                                               before                 iccid                                                                                                                             after
0  <subscriberInfo><msisdn>0123456789</msisdn><iccId>  89410123456789123456  </iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
1  <subscriberInfo><msisdn>9876543210</msisdn><iccId>  89410987654321987456  </iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
2  <subscriberInfo><msisdn>0123987456</msisdn><iccId>  89410987654321098765  </iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>

评论

0赞 Corralien 11/9/2023
您可以编辑模板以覆盖“IMSI”值。这个想法是将 XML 字符串拆分为尽可能多的部分。
1赞 urosdigital 11/10/2023
嗨,@Corralien,非常感谢!这就是我一直在寻找的。我为 ICCID 和 IMSI 列都这样做了。我的实际 xml 就像所以我调整了 pat 部分。<ns2:iccId>89410123456789123456</ns2:iccId>pat = r'(?P<before>.*<ns2:iccId>)(?P<iccId>\d{20})(?P<after></ns2:iccId>.*)'