Cythonize slow parts of the code #204

asmeurer · 2021-08-10T16:45:21Z

For #198 and other performance issues, it may be worth seeing if we can Cythonize key parts of the library to improve the performance. It's still not clear yet whether this can work, or if the performance gains will be significant enough to warrant the additional complexity of having non-pure Python parts of the code.

ArvidJB · 2021-10-11T14:58:09Z

Not sure if you have done this already, but it's probably worth trying to isolate the performance at the various levels:

versioned_hdf5
h5py
libhdf5
file io

When I run a slightly modified repro from #193

import h5py
import numpy as np
from versioned_hdf5 import VersionedHDF5File

if __name__ == '__main__':
    data = np.arange(3*74*3944).reshape((3, 74, 3944))

    with h5py.File('/tmp/foo_versioned.h5', 'w') as f:
        vf = VersionedHDF5File(f)
        with vf.stage_version('r0') as sv:
            sv.create_dataset('bar', data=data, chunks=(900, 3, 3))

    with h5py.File('/tmp/foo_unversioned.h5', 'w') as f:
        f.create_dataset('bar', data=data, chunks=(900, 3, 3), maxshape=(None, None, None))

    # takes about 19 seconds for me
    for _ in range(10):
        with h5py.File('/tmp/foo_versioned.h5', 'r') as f:
            vf = VersionedHDF5File(f)
            a = vf[vf.current_version]['bar'][:]
            assert a.shape == (3, 74, 3944)

    # takes about 9 seconds
    for _ in range(10):
        with h5py.File('/tmp/foo_unversioned.h5', 'r') as f:
            a = f['bar'][:]
            assert a.shape == (3, 74, 3944)

It seems that the versioned_hdf5 layer adds significant overhead over plain h5py. Let's try to bring that down.

It's probably worth writing the equivalent HDF5 code in C to compare it with h5py, to see how much overhead h5py adds. Maybe there are some significant savings there as well?

peytondmurray · 2023-06-06T23:41:56Z

I can look into optimizations that can be done here. Without knowing exactly what the cause of the slow performance is, I can spend 10h investigating what can be done and pursuing simpler optimizations. If I need more time on it, I can report back here with a more in-depth assessment. Does that sounds good?

ericdatakelly added this to the August 2021 milestone Aug 12, 2021

ericdatakelly modified the milestones: August 2021, September 2021 Sep 29, 2021

ericdatakelly modified the milestones: September 2021, October 2021 Oct 6, 2021

ericdatakelly assigned ilanschnell Oct 12, 2021

ericdatakelly modified the milestones: October 2021, November 2021 Dec 2, 2021

ericdatakelly modified the milestones: November 2021, December 2021 Dec 14, 2021

ericdatakelly modified the milestones: December 2021, January 2022 Jan 4, 2022

ericdatakelly modified the milestones: January 2022, February 2022 Feb 14, 2022

ericdatakelly removed this from the February 2022 milestone Mar 9, 2022

peytondmurray unassigned ilanschnell Jun 6, 2023

peytondmurray self-assigned this Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cythonize slow parts of the code #204

Cythonize slow parts of the code #204

asmeurer commented Aug 10, 2021

ArvidJB commented Oct 11, 2021

peytondmurray commented Jun 6, 2023

Cythonize slow parts of the code #204

Cythonize slow parts of the code #204

Comments

asmeurer commented Aug 10, 2021

ArvidJB commented Oct 11, 2021

peytondmurray commented Jun 6, 2023