如何在 Python 中展平嵌套列表,同时保留以逗号分隔的列表元素?

How to flatten a nested lists while retaining elements of the list separated by comma in Python?

提问人:Mehran 提问时间:3/24/2023 最后编辑:Mehran 更新时间:3/25/2023 访问量:67

问:

我想为track_ip字典的键平展我在下面创建的嵌套列表(或首先避免创建它),同时保留用逗号分隔的值。

我有一个数据集 df。我试图在一个名为 track_ip 的字典中跟踪source_ip和destination_ip。我还创建了一个列来表示特定source_ip自上次事件以来的时间,以及destination_ip与上次事件相比是否不同。

对于每个作为键的source_ip,我想要一个引用destination_ip的值列表(可以重复)。我想使用 append() 方法,但它不允许它(因为值是一个字符串),除非我将键的值包装在列表中。当我这样做时,我会得到一个嵌套的列表列表,然后我需要将其展平。如果我使用我使用的方法展平,我将无法保留用逗号分隔的值的元素。

以下是缩短的数据集:

df.head(10).to_dict('list')



{'source_ip': ['135.b1d10.d1c38.20',
  '135.0777d.04511.237',
  '135.0777d.04511.237',
  '135.b1d10.d1c38.119',
  '135.b1d10.13fe9.56',
  '135.b1d10.d1c38.72',
  '135.b1d10.d1c38.126',
  '135.0777d.04511.237',
  '135.0777d.04511.237',
  '135.0777d.04511.237'],
 'destination_ip': ['135.0777d.04511.237',
  '135.b1d10.13fe9.91',
  '135.b1d10.13fe9.71',
  '135.0777d.04511.237',
  '135.0777d.04511.237',
  '135.0777d.04511.237',
  '135.0777d.04511.237',
  '135.b1d10.d1c38.37',
  '135.b1d10.d1c38.112',
  '135.b1d10.d1c38.20'],
 'start_time': [1415749946,
  1415477729,
  1415702327,
  1415754478,
  1415749597,
  1415745508,
  1415754317,
  1415427333,
  1415584036,
  1415582789]}

这是我的一段代码:

import numpy as np
import pandas as pd

#import the dataframe
df = pd.read_csv('df.csv')

#loop through the data
df.loc[:, 'time_since_last'] = 0
df.loc[:, 'diff_destination_ip'] = 0

last_time = dict()
track_ip = dict()

for i,row in df.iterrows():
    if row['source_ip'] in last_time:
        #record delta time since last time under time_since_last
        df.loc[i,'time_since_last']=df.loc[i,'start_time']-last_time[row['source_ip']]
        #check if detination_ip was different for the source_ip and set value diff_destination_ip to 1
        if row['destination_ip'] not in track_ip[row['source_ip']]:
            df.loc[i,'diff_destination_ip'] = 1
    #record the current time as last time for the source_ip
    last_time[row['source_ip']] = row['start_time']
    #record destination_ip, if source_ip already present add the destination_ip to the list
    if row['source_ip'] in track_ip:
        track_ip[row['source_ip']] = [track_ip[row['source_ip']],row['destination_ip']]
        #flatten nested lists for track_ip[row['source_ip']]
        out = []
        for sublist in track_ip[row['source_ip']]:
            out.extend(sublist)
        track_ip[row['source_ip']] = out
        
    else:
        track_ip[row['source_ip']] = row['destination_ip']

我试图得到的是track_ip的输出,如下所示:

print(track_ip)
{'135.b1d10.d1c38.20': '135.0777d.04511.237', '135.0777d.04511.237': ['135.b1d10.13fe9.91', '135.b1d10.13fe9.71', '135.b1d10.d1c38.37', '135.b1d10.d1c38.112', '135.b1d10.d1c38.20'], '135.b1d10.d1c38.119': '135.0777d.04511.237', '135.b1d10.13fe9.56': '135.0777d.04511.237', '135.b1d10.d1c38.72': '135.0777d.04511.237', '135.b1d10.d1c38.126': '135.0777d.04511.237'}

实际数据集有 3.5 个 e5 行。我不能在track_ip中嵌套列表。

如果我使用我使用的方法展平,我会得到以下输出:

{'135.b1d10.d1c38.20': '135.0777d.04511.237', '135.0777d.04511.237': ['1', '3', '5', '.', 'b', '1', 'd', '1', '0', '.', '1', '3', 'f', 'e', '9', '.', '9', '1', '1', '3', '5', '.', 'b', '1', 'd', '1', '0', '.', '1', '3', 'f', 'e', '9', '.', '7', '1', '1', '3', '5', '.', 'b', '1', 'd', '1', '0', '.', 'd', '1', 'c', '3', '8', '.', '3', '7', '1', '3', '5', '.', 'b', '1', 'd', '1', '0', '.', 'd', '1', 'c', '3', '8', '.', '1', '1', '2', '1', '3', '5', '.', 'b', '1', 'd', '1', '0', '.', 'd', '1', 'c', '3', '8', '.', '2', '0'], '135.b1d10.d1c38.119': '135.0777d.04511.237', '135.b1d10.13fe9.56': '135.0777d.04511.237', '135.b1d10.d1c38.72': '135.0777d.04511.237', '135.b1d10.d1c38.126': '135.0777d.04511.237'}

如果我不使用扁平化方法,我将获得键“135.0777d.04511.237”的嵌套列表,如下所示:

{'135.b1d10.d1c38.20': '135.0777d.04511.237', '135.0777d.04511.237': [[[['135.b1d10.13fe9.91', '135.b1d10.13fe9.71'], '135.b1d10.d1c38.37'], '135.b1d10.d1c38.112'], '135.b1d10.d1c38.20'], '135.b1d10.d1c38.119': '135.0777d.04511.237', '135.b1d10.13fe9.56': '135.0777d.04511.237', '135.b1d10.d1c38.72': '135.0777d.04511.237', '135.b1d10.d1c38.126': '135.0777d.04511.237'}
Python Pandas 数据帧 嵌套列表 扁平化

评论

1赞 mozway 3/24/2023
请提供一个最小的可重现示例(不可重现),理想情况下,您可以获得可重现的格式df = pd.read_csv('df.csv')df = pd.DataFrame(...)df.head().to_dict('list')
0赞 Mehran 3/24/2023
谢谢你的建议。我包括了数据集的前 10 行。
0赞 jqurious 3/24/2023
看起来您的循环正在模拟 groupby?例如:track_ip = df.groupby("source_ip")["destination_ip"].agg(np.stack).to_dict()
0赞 Mehran 3/24/2023
@jqurious 是的,你是对的。我想我可以用 groupby 来track_ip。但是,有没有一种扁平化方法可以从嵌套列表获取到单个列表,同时保留该列表的元素,用逗号分隔?
0赞 jqurious 3/25/2023
不是内置的,你必须使用这里的答案之一,有很多。您可以将 if/else 替换为并避免首先创建该嵌套。不过,分组/堆叠更简单。track_ip.setdefault(row['source_ip'], []).append(row['destination_ip'])

答: 暂无答案