Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards an HDF5 type provider in FSharp.Data.Toolbox? #34

Open
dsyme opened this issue Nov 6, 2015 · 16 comments
Open

Towards an HDF5 type provider in FSharp.Data.Toolbox? #34

dsyme opened this issue Nov 6, 2015 · 16 comments

Comments

@dsyme
Copy link
Contributor

dsyme commented Nov 6, 2015

There are discussions about an HDF5 type provider, for example see the discussion and links here https://twitter.com/RodhernACT/status/662202241768665088

FSharp.Data.Toolbox would be a natural eventual landing home for this given the inclusion of an SAS type provider in this project already.

Here are some relevant resources. If you know of more please chime in below

  • You can find out more about HDF5 here https://www.hdfgroup.org/HDF5/
  • The HDF5 User Guide is long but very comprehensive
  • There is an existing wrapper of the HDF5 native binaries for .NET. This looks a little old (circa 2005?) and some users have reported some issue with using this wrapper. It's also not clear if the library is under active development.
  • There is also a relevant PowerShell adaptor for HDF5 that is more recent (circa 2011?) which uses PowerShell's "provider" model to give a file-system like embedding of the information space into the programming model. This may provide inspiration for what an HDF5 type provider for F# can look like.
  • Robert Nielsen (twitter) has done a prototype HDF5 provider and is looking at getting the source out onto github. Having the source out as a compiling project would be a great starting base for a next round of work on this. Robert has blogged about F# and the prototype HDF5 provider. The source is linked from the blog.
  • @memura is also interested in participating in this work and we're opening this issue as a place to discuss what the design of the type provider would be, e.g. what would be the experience of using typical HDF5 data files and what would it mean for the provider to be high performance, complete etc.
  • There is a python pandas reader for HDF5 that may act as inspiration. For example, Deedle may want to include HDF5 access. (Coincidentally, the other data formats being read by Pandas mentioned on that page may also be of interest for FSharp.Data and Deedle over time)
  • There is an HDF5 interface to R
@dsyme dsyme changed the title Charting out an HDF5 type provider Towards an HDF5 type provider in FSharp.Data.Toolbox? Nov 6, 2015
@jimmyheaddon
Copy link

I spoke to Barbara Jones from the HDFGroup yesterday about updates to the C# library, and got this reply:

We do not have funding to support HDF5DotNet C#, and do not currently have any plans to update it. I do know that there are users who are building and using HDF5DotNet successfully with our current release, HDF5-1.8.15-patch1.

@dsyme
Copy link
Contributor Author

dsyme commented Nov 6, 2015

I spent a while talking with @memura to prototype what the programming model supported by the type provider might look like by jotting out what would be allowed. It's a variation of the approach used in Robert Nielsen's prototype:

type SomeHDFFile = HDF5File<"SomeTemplate.H5"> 

// Get the input file
let file1 = SomeHDFFile.Contents

// Use the input file as a template
let file2 = SomeHDFFile.Load("SomeActualData.h5")

// Load many files using the input file as a template
let files = [ for file in Directory.GetFiles("*.h5") -> SomeHDFFile.Load(file) ]

// Getting a data set using a static name
file1.``/group1`` 
file1.``/group1/group2`` 
file1.``/group1/group2/dataset1`` 
file1.``/group1/group2/dataset2`` 
file1.``/group1/group3/dataset1`` 

// Alternatively we could use this but it may not be better except 
// when there are vast numbers of datasets (> 10,000) in the file 
//
// file1.group1.group2.matrix1 

// Note: The data type and rank of each dataset in the template is known to the type 
// provider and used in the provided types for each dataset - see below.
//
// Note: The dimension sizes of data set in the template are also statically known and 
// can be given in a comment.

// Replacing the contents of a data set using a static name
file1.``/group1/group3/dataset1`` <- values

// Writing data sets using a dynamic name
file1.WriteDataSet("/group1/group2/matrix2", anyDataSet)

// Getting a data set using a dynamic name
file1.GetDataset("/group1/group2/matrix1") 

// All data set objects would support an appropriate set of
// slicing, dicing operations given its dimensionality
file1.``/group1/group2/matrix1.Rows 

// on-demand reading
file1.``/group1/group2/matrix1``  // this would not read data
file1.``/group1/group2/matrix1``.[0,0]  // this would read data  

file1.``/group1/group2/matrix1``.[3..4,5..6]  // this would not read data
file1.``/group1/group2/matrix1``.[3..4,5..6].[0,0]  // this would read data  

// slicing 
file1.``/group1/group2/matrix1``.[3..4,*]
file1.``/group1/group2/matrix1``.[3..4,*]
file1.``/group1/group2/matrix1``.[3,*]    
file1.``/group1/group2/matrix1``.[3,*]
file1.``/group1/group2/matrix1``.[3..4,5..6]
file1.``/group1/group2/matrix1``.[3..4,5..6] <- matrix

// There is a large space of matrix types.  Some of these can be baked, 
// some can be provided over a base type.

file1.``/group1/group2/matrix1``    // type is approx  HDF5.MatrixInt_Size_Precision_Offset_Pad_ByteOrder_Signedness
file1.``/group1/group2/matrix2``  // type is approx HDF5.MatrixFloat_Size_Precision_blah_blah
file1.``/group1/group2/matrix3``  // type is approx HDF5.MatrixChar_ASCII
file1.``/group1/group2/matrix4``  // type is approx HDF5.MatrixChar_UTF8
file1.``/group1/group2/matrix5``  // type is approx HDF5.MatrixBits_Size_Precision_Offset_Pad_Bytes
file1.``/group1/group2/matrix6``  // type is approx HDF5.MatrixOpaque_Size_Precision_Offset_Pad_Bytes_Tag
file1.``/group1/group2/matrix7``  // type is approx HDF5.MatrixEnumeration_Elements_Values
file1.``/group1/group2/matrix8``  // type is approx HDF5.MatrixReference  // to object or region within the HDF5 file
file1.``/group1/group2/matrix9``  // type is approx HDF5.MatrixArray_Dimensions_Sizes_BaseType  
file1.``/group1/group2/matrix10``  // type is approx HDF5.MatrixArray_1D_Variable 
file1.``/group1/group2/matrix11``  // type is approx HDF5.MatrixCompound_member1_type1____memberN_typeN

// Get a selection of data sets, weakly typed
file1.GetDatasets("/group1/*/*")    // type is : seq<string * HDF5.Dataset>

// Get all the groups, weakly typed
file1.Groups    // type is seq<HDF5.Group>

// Get all the datasets, weakly typed
file1.Datasets      // type is seq<HDF5.Dataset>

@jmp75
Copy link

jmp75 commented Nov 8, 2015

Thanks for opening the discussion. I'll register my interest in using and contributing to it. There is likely an intersect with my "day job" over the coming months.

I also suggest considering a netCDF data provider as a related line of work. I'd think they are bound to benefit from shared software design, at least partially.

Another work I know of and may be of interest in the roundup of prior art is SDS: Scientific DataSet library and tools, output of a Microsoft Research project if I remember right.

@waynetanner
Copy link

I'm definitely interested in a HDF5 type provider. I may be able to provide help with testing. I'd also really like to see a netCDF provider as I have a lot of legacy code and files using netCDF.

@memura
Copy link

memura commented Nov 28, 2015

Let's get this project underway. Can someone help to set up a skeleton project to enable us to setup a tool box along the lines of the SAS provider in the FSharp.Data.Toolbox?

Also, in addition to the sources mentioned above by @dsyme there is the python project h5py which handily lists the interface to hdf5 which compiements the list in Rodhern's code

I love the efficiency of hdf5 and the fact that so many different systems can read them ( in addition to the ones mentioned above, they can be read in Matlab and Mathematica too). That said, the C/C++ api from HDFGroup is pretty low level and it can be tricky sometimes to understand some of the subtlities of its usage; I have been bitten a few times by not using the correct parameter in certain calls.

The breadth of possibility with hdf5 is immense so our challenge will be to come up with the most useful and efficient type provider we can. I think we should start with numerical data and string arrays first, and focus on the hdf5 dataset (we can look at hdf5 attributes later). Also, h5 files are self descriptive so we can determine from them the structure of the data. My experience with typeproviders is minimal but we should be able to exploit this descriptive information somehow. Compound datatypes are a bit trickier but could be returned as tuples or records possibly.

In addition to returning all the data in a dataset, it would be useful to obtain slabs and select by elements, or masks, and I am quite keen on being able to provide various sub-dimensions of the data as a sequence (or IEnumerable) perhaps with a BlockingCollection type of approach that reads data in slabs for efficiency both of memory and cpu usage.

A lot to think about ...

Let's do it!

@degloff
Copy link

degloff commented Jan 28, 2016

Is the project already under way?

@bradjonesca
Copy link

FYI: HDF5 and .NET: One step back, two steps forward https://hdfgroup.org/wp/2016/01/hdf5-net-one-step-back-two-steps-forward/

@degloff
Copy link

degloff commented Jan 28, 2016

Hi Brad
Thanks, that is very interesting.
Daniel


Daniel Egloff

Dr. sc. math.

InCube Group

Rosenweg 3 | CH-6340 Baar/Zug | Switzerland

Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland

Phone +41 41 501 41 62

Mobile +41 79 430 03 61

[email protected]

www.incubegroup.com

The information contained in this message is for the intended recipient’s
use only. It may contain confidential, proprietary or legally privileged
information. No confidentiality or privilege is waived or lost by any error
in transmission. If you receive this message in error, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print or copy any part of this message if you are not
the intended recipient. Any views expressed in this message are those of
the individual sender, except where the message states otherwise and the
sender is authorized to state them to be the views of any such entity.
InCube Group shall not be liable for damages resulting from the use of
electronic means of communication, including – but not limited to – damages
resulting from failure or delay in delivery of electronic communications,
interception or manipulation of electronic communications by third parties
or by computer programs used for electronic communications and transmission
of viruses and other malicious code.

On 28 January 2016 at 15:23, Brad Jones [email protected] wrote:

FYI: HDF5 and .NET: One step back, two steps forward
https://hdfgroup.org/wp/2016/01/hdf5-net-one-step-back-two-steps-forward/


Reply to this email directly or view it on GitHub
#34 (comment)
.

@waynetanner
Copy link

A thin wrapper as described above would be an extremely useful jumping off point to build high level interfaces to HDF5.

@degloff
Copy link

degloff commented Jan 28, 2016

I agree, notably HdfDotNet works fine but fails badly in the F# interactive
with errors produced in the HDF core lib, like data type cannot be copied
and other strange things.

HdfDotNet is OK from a design point of view, but not available for the
latest version. The performance is also OK.

We use quite large data set and have to use chunking and other advanced
techs. Also we access the files from Windows and Linux. We combine it with
PlotLy to visualize the data.

Having all in the FSI would give us a very productive toolset for
experimentation.


Daniel Egloff

Dr. sc. math.

InCube Group

Rosenweg 3 | CH-6340 Baar/Zug | Switzerland

Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland

Phone +41 41 501 41 62

Mobile +41 79 430 03 61

[email protected]

www.incubegroup.com

The information contained in this message is for the intended recipient’s
use only. It may contain confidential, proprietary or legally privileged
information. No confidentiality or privilege is waived or lost by any error
in transmission. If you receive this message in error, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print or copy any part of this message if you are not
the intended recipient. Any views expressed in this message are those of
the individual sender, except where the message states otherwise and the
sender is authorized to state them to be the views of any such entity.
InCube Group shall not be liable for damages resulting from the use of
electronic means of communication, including – but not limited to – damages
resulting from failure or delay in delivery of electronic communications,
interception or manipulation of electronic communications by third parties
or by computer programs used for electronic communications and transmission
of viruses and other malicious code.

On 28 January 2016 at 15:57, Wayne Tanner [email protected] wrote:

A thin wrapper as described above would be an extremely useful jumping off
point to build high level interfaces to HDF5.


Reply to this email directly or view it on GitHub
#34 (comment)
.

@memura
Copy link

memura commented Jan 29, 2016

I have forked the project to start the HDF5 type provider, but as @waynetanner has pointed out the first step is to produce a light weight wrapper for it in F# ... I have been thinking about this and it is probably best too be able to generate the signatures like ...

[< DllImport(HDF5x64DLL, EntryPoint="H5Lget_name_by_idx", CharSet=CharSet.Ansi, CallingConvention=CallingConvention.StdCall) >]
extern int H5LGetNameByIdx_x64(int locid, [< MarshalAs(UnmanagedType.LPStr) >] StringBuilder groupName, int index_field, int iter_order, uint64 n, [][] byte[] namebuffer, int namebufferlength, int lapl_id)

... directly from the HDF5 header files ...

H5_DLL ssize_t H5Lget_name_by_idx(hid_t loc_id, const char _group_name,
H5_index_t idx_type, H5_iter_order_t order, hsize_t n,
char name /_out/, size_t size, hid_t lapl_id);

... as this would enable us to keep up with changes in the HDF5 C codebase as new versions are released. We would have to write some code to parse these out of the headers.

The enumerations, types and constants are a little trickier because they appear to be much more irregular and so difficult to parse, but are probably less changeable. Any tips on how to do this efficiently would be really helpful.

Robert Nielsen code is a great place to start.

@degloff
Copy link

degloff commented Jan 31, 2016

Note the following activity

https://hdfgroup.org/wp/2016/01/hdf5-net-one-step-back-two-steps-forward/

I sent a mail to Gerd Heber from the HDF Group who runs the project. He
gave me access to the repo, quite some interop bindings are already there,
some are missing, like H5D. Eventually this might be the best basis as it
should be lightweight and up to date.

... directly from the HDF5 header files: we use that approach for other
libs, such as LLVM and various NVIDIA libs (for our .NET GPU compiler). Our
experience is that it is quite labor intensive, but of course a complete
solution. Unfortunately we cannot share our C parsers developed in F# as
they are commercial.

I used Robert Nielsen's code as a basis for a slight refactored version and
I added some generic readers, just for rapid prototyping. I separated the
hdf5 lib and the type provider as I often don't want to have the type
provider but only direct programmatic access to hdf5 files.

I could put the code somewhere in a git repo or share it. Nothing really
special, more a rapid hack to serve my project needs, but eventually
helpful as a start.

Let me know.

D.


Daniel Egloff

Dr. sc. math.

InCube Group

Rosenweg 3 | CH-6340 Baar/Zug | Switzerland

Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland

Phone +41 41 501 41 62

Mobile +41 79 430 03 61

[email protected]

www.incubegroup.com

The information contained in this message is for the intended recipient’s
use only. It may contain confidential, proprietary or legally privileged
information. No confidentiality or privilege is waived or lost by any error
in transmission. If you receive this message in error, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print or copy any part of this message if you are not
the intended recipient. Any views expressed in this message are those of
the individual sender, except where the message states otherwise and the
sender is authorized to state them to be the views of any such entity.
InCube Group shall not be liable for damages resulting from the use of
electronic means of communication, including – but not limited to – damages
resulting from failure or delay in delivery of electronic communications,
interception or manipulation of electronic communications by third parties
or by computer programs used for electronic communications and transmission
of viruses and other malicious code.

On 29 January 2016 at 23:48, memura [email protected] wrote:

I have forked the project to start the HDF5 type provider, but as
@waynetanner https://github.com/waynetanner has pointed out the first
step is to produce a light weight wrapper for it in F# ... I have been
thinking about this and it is probably best too be able to generate the
signatures like ...

[< DllImport(HDF5x64DLL, EntryPoint="H5Lget_name_by_idx",
CharSet=CharSet.Ansi, CallingConvention=CallingConvention.StdCall) >]
extern int H5LGetNameByIdx_x64(int locid, [<
MarshalAs(UnmanagedType.LPStr) >] StringBuilder groupName, int index_field,
int iter_order, uint64 n, [][] byte[] namebuffer, int namebufferlength, int
lapl_id)

... directly from the HDF5 header files ...

H5_DLL ssize_t H5Lget_name_by_idx(hid_t loc_id, const char

_group_name, H5_index_t idx_type, H5_iter_order_t order, hsize_t n, char
name /_out/, size_t size, hid_t lapl_id);

... as this would enable us to keep up with changes in the HDF5 C codebase
as new versions are released. We would have to write some code to parse
these out of the headers.

The enumerations, types and constants are a little trickier because they
appear to be much more irregular and so difficult to parse, but are
probably less changeable. Any tips on how to do this efficiently would be
really helpful.

Robert Nielsen code is a great place to start.


Reply to this email directly or view it on GitHub
#34 (comment)
.

@waynetanner
Copy link

I would definitely argue that the basic HDF5 wrapper library should be usable separate from the type provider. The vast majority of the HDF5 files I've dealt with have known formats and "discovery" isn't really necessary. For use in something like the F# Data Toolbox on the other hand, the type provider would be amazingly useful.

@degloff
Copy link

degloff commented Jan 31, 2016

Fully agree, totally on the same page.

Also using type provider is not always convenient (during development, it
requires restarting VS too often ;-) )


Daniel Egloff

Dr. sc. math.

InCube Group

Rosenweg 3 | CH-6340 Baar/Zug | Switzerland

Brandschenkestrasse 41 | CH-8002 Zurich | Switzerland

Phone +41 41 501 41 62

Mobile +41 79 430 03 61

[email protected]

www.incubegroup.com

The information contained in this message is for the intended recipient’s
use only. It may contain confidential, proprietary or legally privileged
information. No confidentiality or privilege is waived or lost by any error
in transmission. If you receive this message in error, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print or copy any part of this message if you are not
the intended recipient. Any views expressed in this message are those of
the individual sender, except where the message states otherwise and the
sender is authorized to state them to be the views of any such entity.
InCube Group shall not be liable for damages resulting from the use of
electronic means of communication, including – but not limited to – damages
resulting from failure or delay in delivery of electronic communications,
interception or manipulation of electronic communications by third parties
or by computer programs used for electronic communications and transmission
of viruses and other malicious code.

On 31 January 2016 at 16:23, Wayne Tanner [email protected] wrote:

I would definitely argue that the basic HDF5 wrapper library and the type
provider should be separate. The vast majority of the HDF5 files I've dealt
with have known formats and "discovery" isn't really necessary. For use in
something like the F# Data Toolbox on the other hand, the type provider
would be amazingly useful.


Reply to this email directly or view it on GitHub
#34 (comment)
.

@panesofglass
Copy link

Has anyone pursued this further? I was thinking of doing the same and looking for prior work on which to build. Someone pointed me to Robert Nielsen's type provider, and from there I found this thread. I'd like to help.

@panesofglass
Copy link

I wonder whether http://www.hdfql.com/ might be a better path? It seems to offer a much simpler high-level access into the HDF5 API, and they already have a C# wrapper. They've mentioned on Twitter that it could be simple to add an F# wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants