提问人:Zakyl 提问时间:10/28/2021 最后编辑:Zakyl 更新时间:10/29/2021 访问量:527
根据字符串中定义的条件填充新列
Fill new column based on conditions defined in a string
问:
我有条件填充字符串中定义的新列。
condition_string = "colA='yes' & colB='yes' & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
字符串可以以任何其他格式(字典)重写/结构化,然后输入到代码中以获得最终结果。
数据帧是
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB03', 'AB04','AB05', 'AB06'],
'colA': ["yes","yes",'yes',"no","no",'yes', np.nan],
'colB': [np.nan,'yes','yes',"no",'no', np.nan, "yes"],
'colC': ["yes",'yes', 'yes',"no", "no",np.nan,np.nan],
'colD': ["yes",'no', 'yes',"no",np.nan,"no",np.nan],
}
)
我怎样才能在不对 .或者你有什么方法可以重组,然后应用于数据帧?condition_string
condition_string
更新: 如果字典是这样的呢?
condition_string = "colA='yes' & (colB='yes' | colB='no)' &
(colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
数据帧就像
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB03', 'AB04','AB05', 'AB06'],
'colA': ["yes","yes",'yes',"no","no",'yes', np.nan],
'colB': ["no",'yes','yes',"no",'no', np.nan, "yes"],
'colC': ["yes",'yes', 'yes',"no", "no",np.nan,np.nan],
'colD': ["yes",'no', 'yes',"no",np.nan,"no",np.nan]
}
)
答:
1赞
Serge de Gosson de Varennes
10/28/2021
#1
您可以使用:np.where
df['results'] = np.where((((df['colA']=='yes') & (df['colB']=='yes')) & ((df['colC']=='yes') | (df['colD']=='yes'))), 'Yes',np.where(((df['colA']=='no') & (df['colB']=='no')) & ((df['colC']=='no' )| (df['colD']=='no')), 'No','UNKNOWN'))
这给出了:
ID colA colB colC colD decision
0 AB01 yes NaN yes yes UNKNOWN
1 AB02 yes yes yes no Yes
2 AB03 yes yes yes yes Yes
3 AB03 no no no no No
4 AB04 no no no NaN No
5 AB05 yes NaN NaN no UNKNOWN
6 AB06 NaN yes NaN NaN UNKNOWN
评论
0赞
Zakyl
10/28/2021
这是我想避免的。我不想用condition_string对这些东西进行硬编码。我想从condition_string本身获取它,或者将其重构为可用于应用于数据帧的字典
1赞
Henry Yik
10/28/2021
#2
IIUC 你想为 你的 创建任意条件,这可以使用 和 来完成。然后,您可以使用两个列表(而不是字典)来设置条件,第一个是列,第二个是要测试的字符串,最后是:df
functools.reduce
operator.and_
np.select
from functools import reduce
from operator import and_
cols = ["colA", "colB", ["colC", "colD"]] # group the cols in a list if they belong to the same group
answer = ["yes", "no"]
conds = [reduce(and_, [df[i].eq(ans) if isinstance(i, str) else df[i].eq(ans).any(1)
for i in cols]) for ans in answer]
df["result"] = np.select(conds, answer, "Unknown")
print (df)
ID colA colB colC colD result
0 AB01 yes NaN yes yes Unknown
1 AB02 yes yes yes no yes
2 AB03 yes yes yes yes yes
3 AB03 no no no no no
4 AB04 no no no NaN no
5 AB05 yes NaN NaN no Unknown
6 AB06 NaN yes NaN NaN Unknown
现在,您只需要编辑两个列表,以及是否需要调整条件即可。cols
answer
1赞
Tranbi
10/28/2021
#3
这里有一个解决方案,可以将您的条件转换为 python 函数,然后将其应用于 DataFrame 的行:
import re
condition_string = "colA='yes' & colB='yes' & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
# formatting string as python function apply_cond
for col in df.columns:
condition_string = re.sub(rf"(\W|^){col}(\W|$)", rf"\1row['{col}']\2", condition_string)
condition_string = re.sub(rf"row\['{col}'\]\s*=(?!=)", f"row['{col}']==", condition_string)
cond_form = re.sub(r'(:[^[(]+), (?!ELSE)', r'\1\n\telif ', condition_string) \
.replace(": ", ":\n\t\treturn ") \
.replace("&", "and") \
.replace('|', 'or')
cond_form = re.sub(r", ELSE\s*:", "\n\telse:", cond_form)
function_def = "def apply_cond(row):\n\tif " + cond_form
#print(function_def) # uncomment to see how the function is defined
# executing the function definition of apply_cond
exec(function_def)
# applying the function to each row
df["result"]=df.apply(lambda x: apply_cond(x), axis=1)
print(df)
输出:
ID colA colB colC colD result
0 AB01 yes NaN yes yes UNKNOWN
1 AB02 yes yes yes no Yes
2 AB03 yes yes yes yes Yes
3 AB03 no no no no No
4 AB04 no no no NaN No
5 AB05 yes NaN NaN no UNKNOWN
6 AB06 NaN yes NaN NaN UNKNOWN
您可能希望根据以下情况调整字符串格式(我很快就做到了,可能有一些不支持的组合),但如果您自动获取这些字符串,它将避免您重新定义它们。condition_string
评论
0赞
Zakyl
10/28/2021
如果词典更新到上述格式怎么办?是否可以使它适用于上述更新的数据帧场景?
0赞
Zakyl
10/28/2021
如果我更改为 The doesn't return with attached tocondition_string
condition_string = "colA='yes' & colB in ['yes','no'] & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
function_def
row
colB
0赞
Tranbi
10/28/2021
在您的问题中,您在第二个 .括号在括号之前关闭。它应该是:condition_string
condition_string = "colA='yes' & (colB='yes' | colB='no') & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
0赞
Tranbi
10/28/2021
关于我已经更新了我的代码,请查看!就像我说的,根据你的语法,你可能想要扩展替换。colB in
condition_string
0赞
Zakyl
10/29/2021
如果也可以像这种情况一样,我该如何更新,我一直在尝试扩展替换,但没有成功condition_string
condition_string = "col.A in ['Osel', 'Quine', 'Lovir (Kaletra)', 'Lan ate', 'Dar/cob']: 'Yes', ELSE: col.B"
评论