Refactoring of encode and decode to be zero-copy #57

mattiasdrp · 2021-04-23T12:16:19Z

No description provided.

samoht · 2021-04-23T13:29:42Z

We should make sure whatever we do in repr is compatible with the approach explained here: https://github.com/ocaml/RFCs/blob/master/rfcs/modular_io.md

mattiasdrp · 2021-05-04T14:28:51Z

Encode and Decode have been implemented (decode not completely). Right now we need to make sure that these are the types we want and the direction we want to take.

A channel is potentially a stream without a fixed length so the unboxed version may be trickier to implement.

samoht · 2021-05-04T17:35:17Z

Before going further I think we should see what kind of impact this API will have on its users, eg. index and irmin.

For instance, for both of these use-cases the data to read and write will come through pread and pwrite, which require a byte buffer to be provided. So I think the main question to answer is where that buffer is allocated. I don't think your functor proposal solves this problem (but I might be missing something).

Also, is decode a bottleneck right now? mirage/index#280 seems to only talk about encode so can we focus on this one first? The first step is to identify how we can modify index to stop allocating intermediate buffers and directly write data into the file buffer: if we'd want to do this, what API should repr provide? (e.g. write the user code first, before writing the actual implementation in repr and focus on the index/irmin use-case with a data-driven benchmark approach).

pascutto · 2021-05-04T18:46:35Z

Also, is decode a bottleneck right now? mirage/index#280 seems to only talk about encode so can we focus on this one first?

Actually that's my mistake, I believe decoding is just as (if not more) important as encoding during merges, as every data key needs to be decoded, but not all of them need to be encoded. I'll mention that back in mirage/index#280.

craigfe · 2021-05-05T16:22:14Z

To relate a discussion had with @mattiasdrp offline: the approach currently taken here isn't feasible as it forks the universe of typereps according to the channel types. (Repr.t values built for different functor instantiations would be incompatible, when the overall goal is to have a "canonical" type for typereps that can then be consumed uniformly.) This was motivated by wanting to make Custom extensible, but fortunately we don't need to solve this problem for Index / Irmin / Tezos as they use simple types -- we can just raise Failure in these cases for now, as type_random.ml does. (This extensibility issue is something that could be considered as part of a separate change.)

One path forward is to get a minimal working diff for an externally-allocated generic codec in Index (that simply fails in the corner cases), in order to demonstrate viable user code in Index and Irmin. As @samoht suggests, this is the main source of uncertainty (and fortunately these use-cases have simplifying assumptions that we can exploit for a proof of concept).

samoht · 2021-05-11T14:23:23Z

bench/main.ml

@@ -1,6 +1,28 @@
 open Bechamel
 open Toolkit
 module T = Repr
+
+module IO = struct


I am not sure to understand why we need to abstract over IO here. I dont think it's needed for the irmin/index use-case. What's wrong with something like this:

type 'a encode_bin = 'a -> bytes -> off:int -> len:int -> int type 'a decode_bin = bytes -> off:int -> len:int -> ('a * int)

(these are tentative signatures - the idea is to match the signatures of pread and pwrite as explained in my previous comment)

So you would get rid of the functor and enforce the type of the output and input channel to Repr's users if I understand correctly?

If someone wanted to write directly in a different output they'd have to allocate a Byte, encode values in it and write its content in their output, right?

I don't dislike this solution because it makes everything simpler from an implementation point of view but it makes it less generic over the IO the user is using (I'd say it's a one-copy solution instead of zero-copy)

So you would get rid of the functor and enforce the type of the output and input channel to Repr's users if I understand correctly?

yes, the only users are irmin and index, we don't need a generic solution. That library was called irmin-type a few months ago, it was just pulled out to be shared between index and irmin. It's a non goal to cover all the generic usage of a dynamic type library.

I don't dislike this solution because it makes everything simpler from an implementation point of view but it makes it less generic over the IO the user is using (I'd say it's a one-copy solution instead of zero-copy)

which is already (at least) one copy less than it is today. Let's see how it performs on index/irmin before trying to build the perfect solution.

Ok, I'm fine with that! Will do immediately :-)

Need to make sure that everything is actually ok

icristescu · 2021-05-14T12:50:43Z

As discussed offline with @mattiasdrp we should focus on changing encode for now and see how that propagates to irmin/index.

The API changes to decode consists of changing string with bytes (which is not changing much), but there are extra allocations done by decode (for instance https://github.com/mirage/repr/blob/main/src/repr/type_binary.ml#L261) that we can try to remove in a second step (a different PR?) as they are independent of the encode modifications.

This should be cancelled when compiling with flambda but due to the fact that it is not the default compiler it is better right now to manually externalise them to avoid extra allocations

bench/dune

src/repr/type.ml

src/repr/type_binary.ml

test/repr/main.ml

samoht mentioned this pull request Apr 28, 2021

encode_bin instead of encode mirage/index#287

Closed

mattiasdrp changed the title ~~Please read the rfc.org file~~ Refactoring of encode and decode to be zero-copy May 4, 2021

mattiasdrp added 11 commits May 10, 2021 16:54

please read the rfc.org file

524dcb3

First draft for encode

aafe4e2

Update rfc.org

e7cc061

Inverted parameters to fit the original version

6302085

(incomplete) Decoding

8101448

Unfinished implementation with typing problem

26d5156

Include file

0228d90

Simplification

63fafca

Work remaining on bench

4043eda

bring back encoding in custom and other mandatory functions

cb2841c

Problem with boxed strings that use custom encode_bin

58449f5

mattiasdrp mentioned this pull request May 11, 2021

Add support for extensible type attributes #60

Merged

samoht reviewed May 11, 2021

View reviewed changes

Tests and benchs seem ok

90188e7

Need to make sure that everything is actually ok

mattiasdrp added 3 commits May 14, 2021 15:20

Reverted decode changes

d90ea48

Cleaning

a8f9e3c

Local functions that capture external values are bad for performance

cf9716d

This should be cancelled when compiling with flambda but due to the fact that it is not the default compiler it is better right now to manually externalise them to avoid extra allocations

icristescu reviewed May 17, 2021

View reviewed changes

mattiasdrp added 4 commits May 17, 2021 10:59

Changes after @icristescu review

d6680bd

Fix

1bfca61

.opam > dune-project

33a6788

.opam updated

489fe98

mattiasdrp mentioned this pull request May 18, 2021

Modify Irmin to work with the refactoring of Repr's encoding mirage/irmin#1437

Closed

craigfe mentioned this pull request Jun 14, 2021

Extract primitive binary operations from generic derivation #68

Merged

icristescu closed this Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring of encode and decode to be zero-copy #57

Refactoring of encode and decode to be zero-copy #57

mattiasdrp commented Apr 23, 2021

samoht commented Apr 23, 2021 •

edited

Loading

mattiasdrp commented May 4, 2021

samoht commented May 4, 2021 •

edited

Loading

pascutto commented May 4, 2021 •

edited

Loading

craigfe commented May 5, 2021

samoht May 11, 2021 •

edited

Loading

mattiasdrp May 11, 2021

samoht May 11, 2021 •

edited

Loading

mattiasdrp May 11, 2021

icristescu commented May 14, 2021

Refactoring of encode and decode to be zero-copy #57

Refactoring of encode and decode to be zero-copy #57

Conversation

mattiasdrp commented Apr 23, 2021

samoht commented Apr 23, 2021 • edited Loading

mattiasdrp commented May 4, 2021

samoht commented May 4, 2021 • edited Loading

pascutto commented May 4, 2021 • edited Loading

craigfe commented May 5, 2021

samoht May 11, 2021 • edited Loading

Choose a reason for hiding this comment

mattiasdrp May 11, 2021

Choose a reason for hiding this comment

samoht May 11, 2021 • edited Loading

Choose a reason for hiding this comment

mattiasdrp May 11, 2021

Choose a reason for hiding this comment

icristescu commented May 14, 2021

samoht commented Apr 23, 2021 •

edited

Loading

samoht commented May 4, 2021 •

edited

Loading

pascutto commented May 4, 2021 •

edited

Loading

samoht May 11, 2021 •

edited

Loading

samoht May 11, 2021 •

edited

Loading