Pandas 数据框到多维列表列表 (MxNxD)

Pandas Data frame to multi dimensional list of lists (MxNxD)

提问人:Alejandro 提问时间:8/3/2023 最后编辑:Alejandro 更新时间:8/3/2023 访问量:49

问:

问题:

尝试将以下示例数据集分组为 (2) 个组,即 Case 和 Run,然后将其余列转换为列表。为什么是列表列表?

  1. 每次运行之间的样本时间量可能会有所不同
  2. 每种情况下的运行量可能会有所不同。

请注意,我正在尝试在没有直接循环/枚举的情况下执行此操作。

Case,Run,Timestep,Value0,Value1,ValueN
0,0,0,2.626105446,-0.340988725,1.459768689
0,0,1,-0.200319404,-1.709593368,-1.011965009
0,0,2,-2.433554966,-1.558093989,0.44943261
0,1,0,0.02353679,0.280785593,-0.238524231
0,1,1,0.203162019,-1.502525578,1.40292514
0,1,2,-0.037211304,-0.319558867,-1.322124136
0,2,0,-1.45331218,-0.20914855,0.276561536
0,2,1,-0.304656195,-1.594559675,0.264025112
0,2,2,0.030118014,-0.712998507,-0.044159056
1,0,0,0.237803452,0.153455073,-1.084628161
1,0,1,0.814585507,-0.825123856,0.165566792
1,0,2,0.362165592,-0.12731861,-0.572814142
1,1,0,-0.49709788,-0.499310228,0.082185424
1,1,1,0.326764013,0.914642226,-0.96102063
1,1,2,-0.43253699,0.280958492,0.098690723
1,2,0,-0.854096958,0.559531068,0.547050394
1,2,1,0.331055857,1.021091422,0.032419058
1,2,2,0.712204394,-2.050207709,-0.507117825
2,0,0,-0.545231265,1.112460127,0.156334193
2,0,1,-1.507139756,1.296971255,-1.247736656
2,0,2,-0.661400305,0.50431061,1.24670996
2,1,0,-0.215960487,-0.788252945,0.481096264
2,1,1,-1.233657743,0.315172941,0.862535335
2,1,2,0.371303046,0.770596557,2.017713093
2,2,0,0.282822836,-0.332043452,1.107596241
2,2,1,0.085487262,1.104947912,0.545943326
2,2,2,0.864801017,1.082528278,0.386386756

理想情况下,我的最终结果将是类似于 MxNxD 的,其中 D = [TimeStep, Value0, Value1, ValueN]。因此,如果那时:arr = data

arr[0]= 案例 0 中所有内容的嵌套列表n_runs

arr[0][0]= 运行 0 案例 0 的所有样本的嵌套列表

arr[0][0][0]= 运行 0 个例 0 的 Sample0 (1x4)

所需输出:

请注意格式,因为我手动写出了这个,并复制并粘贴了对象结构的第一级:

[[[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]], [[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]], [[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]]]

电流输出 (UPDATE):

下面有效,但它使用的是嵌套的 for 循环。试图找到一种方法来使其更有效率或“pythonic”。

法典:

import pandas as pd
import numpy as np
import json

COL_NAMES = ['Case', 'Run', 'Timestep', 'Value0', 'Value1', 'ValueN']

case = COL_NAMES[0]
run = COL_NAMES[1]
timestep = COL_NAMES[2]

filepath = 'sample.csv'
with open (filepath, newline='') as raw_data_file:
    df = pd.read_csv(raw_data_file, header=0)


nested_list = []
for cases, group in df.groupby([case])[COL_NAMES[1:]]:
    nested_list.append([])
    for runs, group in group.groupby([run])[COL_NAMES[2:]]:
        nested_list[cases].append([])
        samples = group.to_numpy().tolist()
        nested_list[cases][runs] = samples

data = nested_list

with open("sample.json", "w") as output:
        json.dump((data), output)

输出:

[[[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]], [[[0.0, 0.237803452, 0.153455073, -1.084628161], [1.0, 0.814585507, -0.825123856, 0.165566792], [2.0, 0.362165592, -0.12731861, -0.572814142]], [[0.0, -0.49709788, -0.499310228, 0.082185424], [1.0, 0.326764013, 0.914642226, -0.96102063], [2.0, -0.43253699, 0.280958492, 0.098690723]], [[0.0, -0.854096958, 0.559531068, 0.547050394], [1.0, 0.331055857, 1.021091422, 0.032419058], [2.0, 0.712204394, -2.050207709, -0.507117825]]], [[[0.0, -0.545231265, 1.112460127, 0.156334193], [1.0, -1.507139756, 1.296971255, -1.247736656], [2.0, -0.661400305, 0.50431061, 1.24670996]], [[0.0, -0.215960487, -0.788252945, 0.481096264], [1.0, -1.233657743, 0.315172941, 0.862535335], [2.0, 0.371303046, 0.770596557, 2.017713093]], [[0.0, 0.282822836, -0.332043452, 1.107596241], [1.0, 0.085487262, 1.104947912, 0.545943326], [2.0, 0.864801017, 1.082528278, 0.386386756]]]]
pandas list multidimensional-array group-by nested-lists

评论

0赞 RomanPerekhrest 8/3/2023
你能发布有效输出的片段吗?
0赞 Alejandro 8/3/2023
刚刚用一个部分更新了原始帖子。我还上传了第二次尝试,它更接近预期的输出,但是当我尝试嵌套列出时,“键”是元组。Desired Outputs

答:

1赞 Vitalizzare 8/3/2023 #1

按顺序应用groupby

看起来您需要连续 2 个分组才能转换为列表:

columns = df.columns.drop(['Case','Run'])

data = df.groupby(['Case']).apply(
        lambda df: df.groupby('Run')[columns].apply(pd.Series.tolist)
).to_numpy().tolist()

在递归中应用groupby

以下是我们如何自动做到这一点。假设我们想逐个按 2 列以上分组,每次将结果包装在一个单独的列表中。如果我们将这些列设置为索引,然后在每次迭代中按第一个索引级别(即)对数据进行分组,并将相同的逻辑应用于每个删除的组,该怎么办?在最后一步,我们只应用:groupby(level=0)level=0pd.Series.tolist

tolist = pd.Series.tolist

def curry(df): 
    return [
        curry(gr.droplevel(0)) if gr.index.nlevels > 1 else tolist(gr) 
        for _, gr in df.groupby(level=0)
    ]

data = curry(df.set_index(['Case', 'Run']))

以下是我们如何重写代码以可视化 3 级索引上的递归:

g = df.set_index(['Case','Run','Timestep'])

data = \
   [[[tolist(g) for _, g in g.groupby(level=2)]
                for _, g in g.groupby(level=1)]
                for _, g in g.groupby(level=0)]

为了解决原来的问题,我们将停在两个层面上:

g = df.set_index(['Case','Run'])
data = [[tolist(g) for _, g in g.groupby(level=1)]
                   for _, g in g.groupby(level=0)]

评论

0赞 Alejandro 8/3/2023
谢谢!用数据测试,它有效。