-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards an HDF5 type provider in FSharp.Data.Toolbox? #34
Comments
I spoke to Barbara Jones from the HDFGroup yesterday about updates to the C# library, and got this reply:
|
I spent a while talking with @memura to prototype what the programming model supported by the type provider might look like by jotting out what would be allowed. It's a variation of the approach used in Robert Nielsen's prototype:
|
Thanks for opening the discussion. I'll register my interest in using and contributing to it. There is likely an intersect with my "day job" over the coming months. I also suggest considering a netCDF data provider as a related line of work. I'd think they are bound to benefit from shared software design, at least partially. Another work I know of and may be of interest in the roundup of prior art is SDS: Scientific DataSet library and tools, output of a Microsoft Research project if I remember right. |
I'm definitely interested in a HDF5 type provider. I may be able to provide help with testing. I'd also really like to see a netCDF provider as I have a lot of legacy code and files using netCDF. |
Let's get this project underway. Can someone help to set up a skeleton project to enable us to setup a tool box along the lines of the SAS provider in the FSharp.Data.Toolbox? Also, in addition to the sources mentioned above by @dsyme there is the python project h5py which handily lists the interface to hdf5 which compiements the list in Rodhern's code I love the efficiency of hdf5 and the fact that so many different systems can read them ( in addition to the ones mentioned above, they can be read in Matlab and Mathematica too). That said, the C/C++ api from HDFGroup is pretty low level and it can be tricky sometimes to understand some of the subtlities of its usage; I have been bitten a few times by not using the correct parameter in certain calls. The breadth of possibility with hdf5 is immense so our challenge will be to come up with the most useful and efficient type provider we can. I think we should start with numerical data and string arrays first, and focus on the hdf5 dataset (we can look at hdf5 attributes later). Also, h5 files are self descriptive so we can determine from them the structure of the data. My experience with typeproviders is minimal but we should be able to exploit this descriptive information somehow. Compound datatypes are a bit trickier but could be returned as tuples or records possibly. In addition to returning all the data in a dataset, it would be useful to obtain slabs and select by elements, or masks, and I am quite keen on being able to provide various sub-dimensions of the data as a sequence (or IEnumerable) perhaps with a BlockingCollection type of approach that reads data in slabs for efficiency both of memory and cpu usage. A lot to think about ... Let's do it! |
Is the project already under way? |
FYI: HDF5 and .NET: One step back, two steps forward https://hdfgroup.org/wp/2016/01/hdf5-net-one-step-back-two-steps-forward/ |
Hi Brad Daniel Egloff Dr. sc. math. InCube Group Rosenweg 3 | CH-6340 Baar/Zug | Switzerland Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland Phone +41 41 501 41 62 Mobile +41 79 430 03 61 The information contained in this message is for the intended recipient’s On 28 January 2016 at 15:23, Brad Jones [email protected] wrote:
|
A thin wrapper as described above would be an extremely useful jumping off point to build high level interfaces to HDF5. |
I agree, notably HdfDotNet works fine but fails badly in the F# interactive HdfDotNet is OK from a design point of view, but not available for the We use quite large data set and have to use chunking and other advanced Having all in the FSI would give us a very productive toolset for Daniel Egloff Dr. sc. math. InCube Group Rosenweg 3 | CH-6340 Baar/Zug | Switzerland Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland Phone +41 41 501 41 62 Mobile +41 79 430 03 61 The information contained in this message is for the intended recipient’s On 28 January 2016 at 15:57, Wayne Tanner [email protected] wrote:
|
I have forked the project to start the HDF5 type provider, but as @waynetanner has pointed out the first step is to produce a light weight wrapper for it in F# ... I have been thinking about this and it is probably best too be able to generate the signatures like ... [< DllImport(HDF5x64DLL, EntryPoint="H5Lget_name_by_idx", CharSet=CharSet.Ansi, CallingConvention=CallingConvention.StdCall) >] ... directly from the HDF5 header files ... H5_DLL ssize_t H5Lget_name_by_idx(hid_t loc_id, const char _group_name, ... as this would enable us to keep up with changes in the HDF5 C codebase as new versions are released. We would have to write some code to parse these out of the headers. The enumerations, types and constants are a little trickier because they appear to be much more irregular and so difficult to parse, but are probably less changeable. Any tips on how to do this efficiently would be really helpful. Robert Nielsen code is a great place to start. |
Note the following activity https://hdfgroup.org/wp/2016/01/hdf5-net-one-step-back-two-steps-forward/ I sent a mail to Gerd Heber from the HDF Group who runs the project. He ... directly from the HDF5 header files: we use that approach for other I used Robert Nielsen's code as a basis for a slight refactored version and I could put the code somewhere in a git repo or share it. Nothing really Let me know. D. Daniel Egloff Dr. sc. math. InCube Group Rosenweg 3 | CH-6340 Baar/Zug | Switzerland Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland Phone +41 41 501 41 62 Mobile +41 79 430 03 61 The information contained in this message is for the intended recipient’s On 29 January 2016 at 23:48, memura [email protected] wrote:
|
I would definitely argue that the basic HDF5 wrapper library should be usable separate from the type provider. The vast majority of the HDF5 files I've dealt with have known formats and "discovery" isn't really necessary. For use in something like the F# Data Toolbox on the other hand, the type provider would be amazingly useful. |
Fully agree, totally on the same page. Also using type provider is not always convenient (during development, it Daniel Egloff Dr. sc. math. InCube Group Rosenweg 3 | CH-6340 Baar/Zug | Switzerland Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland Phone +41 41 501 41 62 Mobile +41 79 430 03 61 The information contained in this message is for the intended recipient’s On 31 January 2016 at 16:23, Wayne Tanner [email protected] wrote:
|
Has anyone pursued this further? I was thinking of doing the same and looking for prior work on which to build. Someone pointed me to Robert Nielsen's type provider, and from there I found this thread. I'd like to help. |
I wonder whether http://www.hdfql.com/ might be a better path? It seems to offer a much simpler high-level access into the HDF5 API, and they already have a C# wrapper. They've mentioned on Twitter that it could be simple to add an F# wrapper. |
There are discussions about an HDF5 type provider, for example see the discussion and links here https://twitter.com/RodhernACT/status/662202241768665088
FSharp.Data.Toolbox would be a natural eventual landing home for this given the inclusion of an SAS type provider in this project already.
Here are some relevant resources. If you know of more please chime in below
The text was updated successfully, but these errors were encountered: