提问人:urosdigital 提问时间:11/9/2023 最后编辑:urosdigital 更新时间:11/9/2023 访问量:19
如何将一列的子字符串与另一列的值矢量化 - pandas
How to change substring from one column with value from another column vectorized - pandas
问:
我有一个数据帧,其中一列是 xml 字符串 (XML),其中一列 (ICCID) 包含需要用于替换每行 xml 列中的子字符串的值。如果可能的话,我想做这个矢量化,所以我尝试了以下代码:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)
columns = ['MSISDN', 'XML', 'ICCID', 'IMSI']
data = [['0123456789', '<subscriberInfo><msisdn>0123456789</msisdn><iccId>12345678998765432100</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410123456789123456, 112233445566778],
['9876543210', '<subscriberInfo><msisdn>9876543210</msisdn><iccId>98765432112365478900</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410987654321987456, 228024357302211],
['0123987456', '<subscriberInfo><msisdn>0123987456</msisdn><iccId>98765432198765432100</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410987654321098765, 228024357302212]]
df = pd.DataFrame(data=data, columns=columns)
df['NEW_XML'] = df['XML'].replace(to_replace=[r'<iccId>\d{20}</iccId>'], value=[fr'<iccId>{df["ICCID"]}</iccId>'], regex=True)
此方法不起作用,因为 xml 字符串中的目标部分/子字符串已替换为 Series 的实际 pandas 表示形式,如下所示:
0 89410123456789123456\n1 89410987654321987456\n2 89410987654321098765\nName: ICCID, dtype: object
NEW_XML列的预期结果如下所示:
NEW_XML
'<subscriberInfo><msisdn>0123456789</msisdn><iccId>89410123456789123456</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
'<subscriberInfo><msisdn>9876543210</msisdn><iccId>89410987654321987456</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
'<subscriberInfo><msisdn>0123987456</msisdn><iccId>89410987654321098765</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
答:
0赞
Corralien
11/9/2023
#1
你可以使用 str.extract
:
pat = r'(?P<before>.*<iccId>)(?P<iccid>\d{20})(?P<after></iccId>.*)'
xml = df['XML'].str.extract(pat).assign(iccid=df['ICCID'].astype(str))
df['NEW_XML'] = xml['before'] + xml['iccid'] + xml['after']
输出:
>>> df['NEW XML']
0 <subscriberInfo><msisdn>0123456789</msisdn><iccId>89410123456789123456</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
1 <subscriberInfo><msisdn>9876543210</msisdn><iccId>89410987654321987456</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
2 <subscriberInfo><msisdn>0123987456</msisdn><iccId>89410987654321098765</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
Name: NEW_XML, dtype: object
详:
>>> xml
before iccid after
0 <subscriberInfo><msisdn>0123456789</msisdn><iccId> 89410123456789123456 </iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
1 <subscriberInfo><msisdn>9876543210</msisdn><iccId> 89410987654321987456 </iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
2 <subscriberInfo><msisdn>0123987456</msisdn><iccId> 89410987654321098765 </iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
评论
0赞
Corralien
11/9/2023
您可以编辑模板以覆盖“IMSI”值。这个想法是将 XML 字符串拆分为尽可能多的部分。
1赞
urosdigital
11/10/2023
嗨,@Corralien,非常感谢!这就是我一直在寻找的。我为 ICCID 和 IMSI 列都这样做了。我的实际 xml 就像所以我调整了 pat 部分。<ns2:iccId>89410123456789123456</ns2:iccId>
pat = r'(?P<before>.*<ns2:iccId>)(?P<iccId>\d{20})(?P<after></ns2:iccId>.*)'
评论