提问人:Honeybear 提问时间:3/20/2018 更新时间:3/20/2018 访问量:1204
使用动态轴切片时,我可以使用 h5py 进行“延迟”读取吗?
Can I do a "lazy" read with h5py when slicing with dynamic axis?
问:
我有一个HDF5_generator,它返回这样的数据:
for element_i in range(n_elements):
img = f['data'][:].take(indices=element_i, axis=element_axis)
yield img, label, weights
我做切片,因为 h5py 似乎没有提供不同的阅读方式(如果我错了,请纠正我),我这样做(f['data'][:].take(...))
因为我希望切片轴是动态的,不知道如何使用动态轴进行“经典”切片 ()。f['data'][:, :, element_i, :, :]
但这太慢了!我什至不知道会发生什么,因为读取时间波动如此之大,但我假设对于每个 ,整个数据集都被完全读取,有时偶然它仍然被缓存,但有时没有。element_i
data
我想出了“cache_full_file”(请参阅下面的完整代码),这样可以解决它:
cache_full_file = False
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024) image read - Elapsed time: 6.5959 s # every single read can take long
(4, 1024, 1024) image read - Elapsed time: 28.0695 s
(4, 1024, 1024) image read - Elapsed time: 0.6851 s
(4, 1024, 1024) image read - Elapsed time: 3.3492 s
(4, 1024, 1024) image read - Elapsed time: 0.5837 s
(4, 1024, 1024) image read - Elapsed time: 1.0346 s
(4, 1024, 1024) image read - Elapsed time: 2.5852 s
(4, 1024, 1024) image read - Elapsed time: 18.7262 s
(4, 1024, 1024) image read - Elapsed time: 19.1674 s # ...
cache_full_file = True
>> reading 19 elements/rows (poked data with (19, 4, 1024, 1024))
(4, 1024, 1024) image read - Elapsed time: 15.8334 s # dataset is read and cached once
(4, 1024, 1024) image read - Elapsed time: 0.0744 s # following reads are all fast ...
(4, 1024, 1024) image read - Elapsed time: 0.0558 s # ...
但是我不能依赖适合内存的完整文件/数据集!
是否可以执行不读取完整数据集的“延迟”读取,以从 HDF5 数据集中取出切片?
该类代码的简化版本是:
class hdf5_generator:
def __init__(self, file, repeat): self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as f:
n_elements = f['data'].shape[element_axis] # poke first dataset to get number of expected elements)
if cache_full_file:
img_eles = f['data'][:] # read and store the whole dataset in memory
for element_i in range(n_elements):
img = img_eles.take(indices=element_i, axis=element_axis)
yield img
else:
for element_i in range(n_elements):
# access a specific row in the dataset
img = f['data'][:].take(indices=element_i, axis=element_axis)
yield img, label, weights
答: 暂无答案
评论