Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of encode and decode to be zero-copy #57

Closed
wants to merge 19 commits into from
Closed

Refactoring of encode and decode to be zero-copy #57

wants to merge 19 commits into from

Conversation

mattiasdrp
Copy link
Contributor

No description provided.

@samoht
Copy link
Member

samoht commented Apr 23, 2021

We should make sure whatever we do in repr is compatible with the approach explained here: https://github.com/ocaml/RFCs/blob/master/rfcs/modular_io.md

@mattiasdrp
Copy link
Contributor Author

Encode and Decode have been implemented (decode not completely). Right now we need to make sure that these are the types we want and the direction we want to take.

A channel is potentially a stream without a fixed length so the unboxed version may be trickier to implement.

@mattiasdrp mattiasdrp changed the title Please read the rfc.org file Refactoring of encode and decode to be zero-copy May 4, 2021
@samoht
Copy link
Member

samoht commented May 4, 2021

Before going further I think we should see what kind of impact this API will have on its users, eg. index and irmin.

For instance, for both of these use-cases the data to read and write will come through pread and pwrite, which require a byte buffer to be provided. So I think the main question to answer is where that buffer is allocated. I don't think your functor proposal solves this problem (but I might be missing something).

Also, is decode a bottleneck right now? mirage/index#280 seems to only talk about encode so can we focus on this one first? The first step is to identify how we can modify index to stop allocating intermediate buffers and directly write data into the file buffer: if we'd want to do this, what API should repr provide? (e.g. write the user code first, before writing the actual implementation in repr and focus on the index/irmin use-case with a data-driven benchmark approach).

@pascutto
Copy link

pascutto commented May 4, 2021

Also, is decode a bottleneck right now? mirage/index#280 seems to only talk about encode so can we focus on this one first?

Actually that's my mistake, I believe decoding is just as (if not more) important as encoding during merges, as every data key needs to be decoded, but not all of them need to be encoded. I'll mention that back in mirage/index#280.

@craigfe
Copy link
Member

craigfe commented May 5, 2021

To relate a discussion had with @mattiasdrp offline: the approach currently taken here isn't feasible as it forks the universe of typereps according to the channel types. (Repr.t values built for different functor instantiations would be incompatible, when the overall goal is to have a "canonical" type for typereps that can then be consumed uniformly.) This was motivated by wanting to make Custom extensible, but fortunately we don't need to solve this problem for Index / Irmin / Tezos as they use simple types -- we can just raise Failure in these cases for now, as type_random.ml does. (This extensibility issue is something that could be considered as part of a separate change.)

One path forward is to get a minimal working diff for an externally-allocated generic codec in Index (that simply fails in the corner cases), in order to demonstrate viable user code in Index and Irmin. As @samoht suggests, this is the main source of uncertainty (and fortunately these use-cases have simplifying assumptions that we can exploit for a proof of concept).

bench/main.ml Outdated
@@ -1,6 +1,28 @@
open Bechamel
open Toolkit
module T = Repr

module IO = struct
Copy link
Member

@samoht samoht May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand why we need to abstract over IO here. I dont think it's needed for the irmin/index use-case. What's wrong with something like this:

type 'a encode_bin = 'a -> bytes -> off:int -> len:int -> int
type 'a decode_bin = bytes -> off:int -> len:int -> ('a * int)

(these are tentative signatures - the idea is to match the signatures of pread and pwrite as explained in my previous comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would get rid of the functor and enforce the type of the output and input channel to Repr's users if I understand correctly?

If someone wanted to write directly in a different output they'd have to allocate a Byte, encode values in it and write its content in their output, right?

I don't dislike this solution because it makes everything simpler from an implementation point of view but it makes it less generic over the IO the user is using (I'd say it's a one-copy solution instead of zero-copy)

Copy link
Member

@samoht samoht May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would get rid of the functor and enforce the type of the output and input channel to Repr's users if I understand correctly?

yes, the only users are irmin and index, we don't need a generic solution. That library was called irmin-type a few months ago, it was just pulled out to be shared between index and irmin. It's a non goal to cover all the generic usage of a dynamic type library.

I don't dislike this solution because it makes everything simpler from an implementation point of view but it makes it less generic over the IO the user is using (I'd say it's a one-copy solution instead of zero-copy)

which is already (at least) one copy less than it is today. Let's see how it performs on index/irmin before trying to build the perfect solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm fine with that! Will do immediately :-)

Need to make sure that everything is actually ok
@icristescu
Copy link
Contributor

As discussed offline with @mattiasdrp we should focus on changing encode for now and see how that propagates to irmin/index.

The API changes to decode consists of changing string with bytes (which is not changing much), but there are extra allocations done by decode (for instance https://github.com/mirage/repr/blob/main/src/repr/type_binary.ml#L261) that we can try to remove in a second step (a different PR?) as they are independent of the encode modifications.

This should be cancelled when compiling with flambda but due to the
fact that it is not the default compiler it is better right now to
manually externalise them to avoid extra allocations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants