使用动态轴切片时，我可以使用 h5py 进行“延迟”读取吗？-解网

问：

我有一个HDF5_generator，它返回这样的数据：

for element_i in range(n_elements):
    img = f['data'][:].take(indices=element_i, axis=element_axis)
    yield img, label, weights

我做切片，因为 h5py 似乎没有提供不同的阅读方式（如果我错了，请纠正我），我这样做（f['data'][：].take（...））因为我希望切片轴是动态的，不知道如何使用动态轴进行“经典”切片（）。f['data'][:, :, element_i, :, :]

但这太慢了！我什至不知道会发生什么，因为读取时间波动如此之大，但我假设对于每个，整个数据集都被完全读取，有时偶然它仍然被缓存，但有时没有。element_idata

我想出了“cache_full_file”（请参阅下面的完整代码），这样可以解决它：

cache_full_file = False
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024)  image read - Elapsed time: 6.5959 s            # every single read can take long
(4, 1024, 1024)  image read - Elapsed time: 28.0695 s
(4, 1024, 1024)  image read - Elapsed time: 0.6851 s
(4, 1024, 1024)  image read - Elapsed time: 3.3492 s
(4, 1024, 1024)  image read - Elapsed time: 0.5837 s
(4, 1024, 1024)  image read - Elapsed time: 1.0346 s
(4, 1024, 1024)  image read - Elapsed time: 2.5852 s
(4, 1024, 1024)  image read - Elapsed time: 18.7262 s
(4, 1024, 1024)  image read - Elapsed time: 19.1674 s           # ...


cache_full_file = True
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024)  image read - Elapsed time: 15.8334 s           # dataset is read and cached once
(4, 1024, 1024)  image read - Elapsed time: 0.0744 s            # following reads are all fast ...      
(4, 1024, 1024)  image read - Elapsed time: 0.0558 s            # ...

但是我不能依赖适合内存的完整文件/数据集！

是否可以执行不读取完整数据集的“延迟”读取，以从 HDF5 数据集中取出切片？

该类代码的简化版本是：

class hdf5_generator:
    def __init__(self, file, repeat): self.file = file
    def __call__(self):
        with h5py.File(self.file, 'r') as f:
            n_elements = f['data'].shape[element_axis] # poke first dataset to get number of expected elements)

            if cache_full_file:
                img_eles = f['data'][:]     # read and store the whole dataset in memory
                for element_i in range(n_elements):
                    img = img_eles.take(indices=element_i, axis=element_axis)
                    yield img
            else:
                for element_i in range(n_elements):
                    # access a specific row in the dataset
                    img = f['data'][:].take(indices=element_i, axis=element_axis)
                    yield img, label, weights

Python 性能切片 HDF

使用动态轴切片时，我可以使用 h5py 进行“延迟”读取吗？

Can I do a "lazy" read with h5py when slicing with dynamic axis?

评论