通过匹配多个列和条件对数据帧进行排序-解网

问：

输入数据：

| source_element |   source_text     | source_entity  | target_element | target_text  | target_entity  |
|---------------|-------------------|----------------|---------------|-------------|----------------|
|    vvvvvv      |       ['body']    |      name      |     aaaaa      |     ceo      |      code      |
|    vvvvvv      |       ['body']    |      name      |     bbbbb      |     cap      |      code      |
|    vvvvvv      |  ['body', 'fat']  |      name      |     ccccc      |     fat      |      name      |
|    hhhhhh      | ['man', 'hat']    |      name      |     kkkkk      |     hat      |      name      |
|    kkkkk       |      ['hat']      |      name      |     lllll      |     700      |     price      |
|    kkkkk       |      ['hat']      |      name      |     fffff      |     mud      |      code      |
|    eeeee       |     ['strong']    |      name      |     zzzzz      |     300      |     price      |
|    ssssss      |      ['head']     |      name      |     ddddd      |     zip      |      code      |
|    ssssss      | ['head', 'strong']|      name      |     eeeee      |    strong    |      name      |
|    ssssss      |  ['head', 'part'] |      name      |     uuuuu      |     part     |      name      |
|    uuuuu       |      ['part']     |      name      |     xxxxx      |     190      |     price      |

输出：

|  source_text   |  source_entity    |  target_text   |  target_entity    |
|-----------------|----------------|----------------|-------------------|
| ['body', 'fat']|       name        |     'ceo'      |       code        |
| ['body', 'fat']|       name        |     'cap'      |       code        |
| ['man', 'hat'] |       name        |     '700'      |      price        |
| ['man', 'hat'] |       name        |     'mud'      |       code        |
| ['head', 'strong'] |    name        |     'zip'      |       code        |
| ['head', 'strong'] |    name        |     '300'      |      price        |
| ['head', 'part'] |      name        |     'zip'      |       code        |
| ['head', 'part'] |      name        |     '190'      |      price        |

我需要满足以下条件的输出：

1.如果source_entity和target_entity是“name”，则该特定行的source_element或target_element与整个DataFrame中任何行的source_element匹配，将target_text和text_entity替换为匹配的数据帧行，并删除source_entity行，target_entity为“name”。

2.保留其他不受上述条件影响的数据帧行

pandas 数据帧匹配多列

import pandas as pd

cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
bbb = df[~df.index.isin(df[cond].index)]

def fff(x):
    a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
    b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
    if len(a) > 0:
        df.loc[a.index, 'source_text'] = df.loc[x, 'source_text']
        return x
    elif len(b) > 0:
        df.loc[b.index, 'source_text'] = df.loc[x, 'source_text']
        return x


aaa = [fff(i) for i in df[cond].index]
ind = df.index.isin(aaa)

print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']])

输出

     source_text source_entity target_text target_entity
0    'body  fat'          name       'ceo'          code
1    'body  fat'          name       'cap'          code
4      'man hat'          name       '700'         price
5  'head strong'          name       'zip'          code

如果需要重置索引：

print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']].reset_index(drop=True))

更新

这里使用了一种完全不同的方法，其中生成（创建）字符串。

“source_text”列数据将转换为字符串。df['source_text'].str.join(',')

和上次一样，我们得到的字符串没有双“名称”。bbb

但是，在函数中，获取了 'source_element'， 'target_element' 的匹配索引，如果找到它们，我们得到 .一切都写在列表中。在行中生成。source_text列是动态创建的。生成的列表由连接。数据帧已创建。fff'source_text'aaaqqqassign(source_text=i[1])np.vstack

import numpy as np
import pandas as pd

df['source_text'] = df['source_text'].str.join(',')

cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
bbb = df[~df.index.isin(df[cond].index)]


def fff(x):
    a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
    b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
    if len(a) > 0 or len(b) > 0:
        tex = df.loc[x, 'source_text']
        return [list(a.index) + list(b.index), tex]


aaa = [fff(i) for i in df[cond].index]

qqq = np.array(
    [df.loc[i[0], ['source_entity', 'target_text', 'target_entity']].copy().assign(source_text=i[1]).values for i in aaa])

qqq = np.vstack(qqq)

df1 = pd.DataFrame(qqq, columns=['source_entity', 'target_text', 'target_entity', 'source_text'])

df1['source_text'] = df1['source_text'].str.split(',')

print(df1[['source_text', 'source_entity', 'target_text', 'target_entity']])

输出

   source_text source_entity target_text target_entity
0     body,fat          name         ceo          code
1     body,fat          name         cap          code
2      man,hat          name         700         price
3      man,hat          name         mud          code
4  head,strong          name         zip          code
5  head,strong          name         300         price
6    head,part          name         zip          code
7    head,part          name         190         price

If you need to return the lists back:

df1['source_text'] = df1['source_text'].str.split(',')

Can You modify the code If I include extar input data: [('kkkkk' , 'hat', name, 'fffff', 'mud', code), ('eeeee', 'strong', name, 'zzzzz', '300', price), ('ssssss', 'head', 'part', name, 'uuuuu', 'part,' name),('uuuuu', 'part', name, 'xxxxx', '190', price)] Extra output: [('man', 'hat', name, 'mud', code), ('head', 'strong', name, '300', price), ('head', 'part', name, 'zip' , code), ('head', 'part', name , '190', price)]

0赞 inquirer 5/10/2023

@Anonymous post a table and what you want to get with explanations. By the fact that you have given it is not clear what the table and the expected result consist of.

0赞 inquirer 5/10/2023

To quickly understand: what are the differences from the previous version? And the data in the string is exactly in the list: ['body', 'fat'] or is it a string: 'body', 'fat'?

1赞 Anonymous 5/10/2023

It is exactly like list of strings :['body', 'fat'] and one more difference is I added these two rows in input data, ('ssssss', ['head', 'part'], 'name', 'uuuuu', 'part', 'name'), ('uuuuu', ['part'], 'name', 'xxxxx', '190', 'price');. The output differs from previous version a bit. But the conditions remain same!

1赞 Anonymous 5/10/2023

for idx, row in df[cond].iterrows():     a = df.loc[(df['source_element'] == row['source_element']) & (df['source_entity'] == 'name')]     b = df.loc[(df['source_element'] == row['target_element']) & (df['source_entity'] == 'name')]     source_text = row['source_text']     if len(a) > 0:         df.loc[a.index, 'source_text'] = [source_text] * len(a)     if len(b) > 0:         df.loc[b.index, 'source_text'] = [source_text] * len(b)

I modified the above code like this. But I am missing this row in output: ['head', 'strong'] name 'zip' code

上一个：如果单独数据帧中的列值匹配，则将单独的列相加（如果列值大于零）

下一个：如何遍历 Pandas 中列中的列表以找到匹配项？

通过匹配多个列和条件对数据帧进行排序

Sorting the dataframe by matching multiple columns and conditions

评论

评论