使用动态轴切片时,我可以使用 h5py 进行“延迟”读取吗?

Can I do a "lazy" read with h5py when slicing with dynamic axis?

提问人:Honeybear 提问时间:3/20/2018 更新时间:3/20/2018 访问量:1204

问:

我有一个HDF5_generator,它返回这样的数据:

for element_i in range(n_elements):
    img = f['data'][:].take(indices=element_i, axis=element_axis)
    yield img, label, weights

我做切片,因为 h5py 似乎没有提供不同的阅读方式(如果我错了,请纠正我),我这样做(f['data'][:].take(...))因为我希望切片轴是动态的,不知道如何使用动态轴进行“经典”切片 ()。f['data'][:, :, element_i, :, :]

但这太慢了!我什至不知道会发生什么,因为读取时间波动如此之大,但我假设对于每个 ,整个数据集都被完全读取,有时偶然它仍然被缓存,但有时没有。element_idata

我想出了“cache_full_file”(请参阅下面的完整代码),这样可以解决它:

cache_full_file = False
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024)  image read - Elapsed time: 6.5959 s            # every single read can take long
(4, 1024, 1024)  image read - Elapsed time: 28.0695 s
(4, 1024, 1024)  image read - Elapsed time: 0.6851 s
(4, 1024, 1024)  image read - Elapsed time: 3.3492 s
(4, 1024, 1024)  image read - Elapsed time: 0.5837 s
(4, 1024, 1024)  image read - Elapsed time: 1.0346 s
(4, 1024, 1024)  image read - Elapsed time: 2.5852 s
(4, 1024, 1024)  image read - Elapsed time: 18.7262 s
(4, 1024, 1024)  image read - Elapsed time: 19.1674 s           # ...


cache_full_file = True
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024)  image read - Elapsed time: 15.8334 s           # dataset is read and cached once
(4, 1024, 1024)  image read - Elapsed time: 0.0744 s            # following reads are all fast ...      
(4, 1024, 1024)  image read - Elapsed time: 0.0558 s            # ...

但是我不能依赖适合内存的完整文件/数据集!

是否可以执行不读取完整数据集的“延迟”读取,以从 HDF5 数据集中取出切片?


该类代码的简化版本是:

class hdf5_generator:
    def __init__(self, file, repeat): self.file = file
    def __call__(self):
        with h5py.File(self.file, 'r') as f:
            n_elements = f['data'].shape[element_axis] # poke first dataset to get number of expected elements)

            if cache_full_file:
                img_eles = f['data'][:]     # read and store the whole dataset in memory
                for element_i in range(n_elements):
                    img = img_eles.take(indices=element_i, axis=element_axis)
                    yield img
            else:
                for element_i in range(n_elements):
                    # access a specific row in the dataset
                    img = f['data'][:].take(indices=element_i, axis=element_axis)
                    yield img, label, weights
Python 性能 切片 HDF

评论

0赞 max9111 3/20/2018
这取决于数据集的块形状。您还可以缓存读取和解压缩的块。看一看 stackoverflow.com/a/48405220/4045774

答: 暂无答案