Datashape scalability #13

wesm · 2017-03-16T22:11:45Z

I have a high level design question concerning using text as a serialized representation of array metadata. In my opinion, it is not the best choice as a primary representation. Let me explain why

A problem that the big data community has run into concerns "large metadata" -- storing very wide or very complex tables in file formats like Parquet has resulted in performance problems because manipulating the metadata itself (serialized as a Thrift message) is unwieldy -- parsing and serializing is expensive, and doing small projections / selecting certain fields out of a large data set incurs the cost of the entire metadata.

In Apache Arrow, we have a more restricted type system than datashape, but we decided to use a "zero copy" or "no parse" technology, Flatbuffers (from Google), to represent metadata in a portable way. We chose Flatbuffers over alternatives (like Cap'n Proto -- created by the main author of Google's Protocol Buffers file format) because of community traction and cross-language support (it has first class support for C++, C#, C, Go, Java, JS, PHP, and Python, with Rust and other languages in the works).

Using a technology like Flatbuffers for the serialized representation enables O(1) inspection of schemas, without any copying or parsing overhead, regardless of how big the schema is. Have you thought about using something like this instead?

skrah · 2017-03-17T10:47:47Z

Thanks for the ideas! I would be curious about the performance aspect before addressing the other points. FlatBuffers has a benchmark that uses a very small struct:

https://google.github.io/flatbuffers/flatbuffers_benchmarks.html

libndtypes2 has a benchmark that uses a huge struct (make bench). The file is:

https://github.com/blaze/libndtypes2/blob/master/tools/bench.c

Parsing and deallocating the huge struct 1million times takes 7.2s on my slowish T400 machine.
Using the small struct from the FlatBuffers benchmark in tools/bench.c, 1 million repetitions take 2.5s.

Unfortunately the FlatBuffers benchmarks don't build here, so I'm using their self-reported time (likely measured on a faster machine that the T400).

Encode (also 1 million times): 3.2s.
Dealloc (also 1 million times): 0.08s.

So on the surface creating and deallocating the types seems quite fast in libndtypes2.

skrah · 2017-03-17T12:41:12Z

Correction, the benchmark used the old syntax and was broken (should be fixed now in master). The real numbers are:

Huge struct, 1 million times: 112s
Small struct, 1 million times: 5.4s

So it is 5.4 vs 3.2 (self-reported -- machine?, exact benchmark?).

wesm · 2017-03-17T14:09:05Z

So these numbers are "time to generate a serialized string from an in-memory schema"? The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements -- the issues I'm describing would show up in use cases like "read the metadata for a given subset of 100 particular fields".

Memory efficiency is also a concern -- one of the benefits of Flatbuffers is that you can select what you want from e.g. a memory-mapped file without any unnecessary copying.

I can collaborate with you on some benchmarks for these use cases sometime in the next week or so if you'd like.

wesm · 2017-03-17T17:00:28Z

Sorry reading this again

So it is 5.4 vs 3.2

These numbers aren't comparable because the tasks aren't the same. To do a proper comparison we will have to write Flatbuffers IDL that encodes identical data to what's being parsed in libndtypes2. I can help with this

teoliphant · 2017-03-23T16:11:52Z

Thanks for the reference and connection. I want to make sure we are on the same page as data-shape is only about describing data in memory or disk and not a serialization strategy.

Are you suggesting that we use flat-buffers to store the data-shape meta-data rather than the current data-shape string? An standard for encoding data-shape meta-data to avoid the parsing step at all sounds great.

wesm · 2017-03-23T16:19:14Z

That's right. When you use zero-copy tools up and down the stack, the on disk representation and the in-memory one are the same. So it's suboptimal for metadata to take longer to "load" for larger schemas when general zero-copy "serialization" tools like this exist.

teoliphant · 2017-03-23T16:31:51Z

Great suggestion! I just wanted to make sure I understood and Stefan understood.

skrah · 2017-03-23T17:27:25Z

I'm still trying to figure this out -- Think about this potential datashape use case, e.g. typing a Numba gufunc. This is the existing syntax without datashape:

@guvectorize([(float64[:,:], float64[:,:], float64[:,:])], '(m,n),(n,p)->(m,p)')
def numba_matmul(x, y, res):
    col = np.arange(y.shape[0])
    for j in range(y.shape[1]):
        for k in range(y.shape[0]):
            col[k] = y[k, j]
        for i in range(x.shape[0]):
            s = 0
            for k in range(x.shape[1]):
                s += x[i, k] * col[k]

With datashape it could look like this:

@guvectorize('(M * N * float64, N * P * float64) -> M * P * float64')
def numba_matmul(x, y, res):
    ...

The reasonable thing to do here would be to parse the datashape string once and keep the datashape tree alive in memory. The tree can be walked directly to find elements, guide gufuncs to the proper location and so forth.

Can flatbuffers store ndarrays at all? How would the symbolic dimensions used above be represented?

skrah · 2017-03-23T17:58:09Z

The easiest way to see how complex the trees get is to print the AST:

make print_ast
echo "(M * N * float64, N * P * float64) -> M * P * float64" | ./print_ast -

wesm · 2017-03-23T18:14:51Z

The reasonable thing to do here would be to parse the datashape string once and keep the datashape tree alive in memory. The tree can be walked directly to find elements, guide gufuncs to the proper location and so forth.

I think you're arguing based on a misconception. If you want to convert the Flatbuffer to some other data structure (like the C structs you have defined in this project), you can certainly do that. But it is a zero-copy data structure that you can open and examine without any parsing overhead.

Want to work on a prototype together?

wesm · 2017-03-23T18:21:20Z

Here's an example type that can store ndarray metadata:

table FixedDimension {
  size: long;
}

table SymbolicDimension {
  name: string;
}

union Dimension {
  FixedDimension,
  SymbolicDimension
}

table Tensor {
  value_type: Type;
  shape: [Dimension];
}

(I'm not certain whether you can have an array of union values, you may need to wrap them in a non-union container)

skrah · 2017-03-24T18:57:10Z

Thanks for the example! So the suggestion is to replace all data structures defined in

https://github.com/blaze/libndtypes2/blob/master/ndtypes.h

with the flatbuffer schema and use flatbuffers for everything, including type tree traversal for gufuncs and pattern matching of types (this needs to be fast, surely there's some overhead in flatbuffers)?

wesm · 2017-03-24T21:03:02Z

It's definitely a better primary serialized representation instead of a string that you have to parse. It also gives you the ability to read the metadata in all of the languages that Flatbuffers support. You could use another tool like Protocol Buffers or Thrift for the same purpose, but as discussed the parsing overhead will grow burdensome with large schemas

As far as performance, we should write some microbenchmarks to understand the number of nanoseconds it takes to perform certain performance tasks. Do you have any code examples for what you're describing?

In C++ at least, the generated bindings are header only so everything gets inlined

teoliphant · 2017-03-24T23:36:03Z

Hey Stefan. If I understand correctly, flatbuffers would not replace any code in ndtypes.h for now nor would it replace the datashape strings we are using in numba or gumath. It could, however be used as a representation of datashape that is passed between computer processes.

This looks like a very useful optimization, but I would not place it on the current critical path for you. A PR would be welcome, however.

datnamer · 2017-03-26T03:05:16Z

Can't the type pattern matching be inlined by numba?

skrah · 2018-04-05T07:19:47Z

@wesm "The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements."

I've implemented no-parse serializing/deserializing, here's a benchmark:

https://github.com/plures/ndtypes/blob/master/python/bench.py

As expected, the largest speedup is for a ragged array with 10000000 offsets. The speedup for a large struct is nice, but not spectacular:

Parse 10_000_000 var offsets:
4.600749731063843

Deserialize 10_000_000 var offsets:
0.029728412628173828

Parse large type (100_000 repetitions):
15.476191997528076

Deserialize large type (100_000 repetitions):
3.6464648246765137

I'd be interested in the Arrow numbers for a type with 10000000 offsets, but I cannot figure out how to serialize the offsets.

skrah · 2018-04-05T08:55:42Z

I managed to pickle Arrow types, and here are the numbers (ndtypes master, pyarrow-0.9.0 wheel from pypi). I'm not sure if this is a fair comparison, but ndtypes does quite well.

from ndtypes import ndt
import pyarrow as pa
import pickle
import time


s = "{%s}" % ", ".join("x%s: int64" % n for n in range(100000))
t = ndt(s)

print("Ndtypes: pickle/unpickle huge struct:\n")
start = time.time()
s = pickle.dumps(t)
u = pickle.loads(s)
end = time.time()
print(end-start)
assert u == t



fields = [pa.field('x%s' % n, pa.int64()) for n in range(100000)]
t = pa.struct(fields)

print("\nArrow: pickle/unpickle huge struct:\n")
start = time.time()
s = pickle.dumps(t)
u = pickle.loads(s)
end = time.time()
print(end-start)
assert u == t

Ndtypes: pickle/unpickle huge struct:

0.07116484642028809

Arrow: pickle/unpickle huge struct:

2.2060387134552

wesm · 2018-04-05T16:24:05Z

I don't think that's a useful comparison. It would be more comparable to measure against a Flatbuffers representation

skrah · 2018-04-05T16:38:50Z

Why? I'm pickling/unpickling an in-memory ndt_t and an in-memory <class 'pyarrow.lib.StructType'>.

At least it looks that way to me.

wesm · 2018-04-05T16:41:41Z

Pickling is not a primary serialization path; it is merely a convenience for Python users to be able to pickle the pyarrow container types.

skrah · 2018-04-05T17:06:30Z

I'm trying to answer your question: "The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements."

I accept if this is not a useful comparison, but could you please post the fast way to serialize/deserialize so I can see if this issue has been addressed or if we still need to catch up with flatbuffers?

wesm · 2018-04-05T17:17:20Z

I'm off of open source this month -- I can possibly take a look next month

skrah · 2018-04-05T17:26:56Z

Ah sorry, let's postpone this then!

wesm · 2018-04-05T17:32:37Z

Just to go back to the original point: I was not trying to compare Apache Arrow with libndtypes. I was pointing out that you invented a structured data serialization scheme for the ndtypes metadata, and there are already libraries (Flatbuffers, Protocol Buffers, Apache Thrift) designed by companies like Facebook and Google that solve that problem with certain benefits like forward compatibility / schema evolution, and possibly also better performance (though this latter point will have to be analyzed further).

teoliphant · 2018-04-05T18:41:05Z

Thanks for clarifying, Wes. How to leverage Flatbuffers remains a valid point that we should consider. I appreciate your pointing out the similarities you see. With so much happening in open-source today, it is easy to miss-out on projects that would help achieve the ultimate goal.

Flatbuffers certainly have nice features like versioning which libndtypes doesn't really need yet. I think one question we will always have is if it is really easier to depend on Flatbuffers or just use libndt if you are trying to speed up parsing data-shape between libraries using it to describe typed containers.

I might be missing something, but I don't yet see the benefit of introducing a dependency on Flatbuffers.

skrah · 2018-04-06T15:08:40Z

One non-intrusive way of supporting flatbuffers would be to convert libndtypes/serialize/serialize.c and libndtypes/serialize/deserialize.c to fbserialize.cpp and fbdeserialize.cpp and add --with-flatbuffers to ./configure.

So the C++ dependency would be optional.

Depending on the design of flatbuffers, it is hopefully s straightforward as converting write_int64 etc. to fbwrite_int64 and such.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datashape scalability #13

Datashape scalability #13

wesm commented Mar 16, 2017 •

edited

Loading

skrah commented Mar 17, 2017 •

edited

Loading

skrah commented Mar 17, 2017

wesm commented Mar 17, 2017

wesm commented Mar 17, 2017

teoliphant commented Mar 23, 2017

wesm commented Mar 23, 2017

teoliphant commented Mar 23, 2017

skrah commented Mar 23, 2017

skrah commented Mar 23, 2017

wesm commented Mar 23, 2017

wesm commented Mar 23, 2017 •

edited

Loading

skrah commented Mar 24, 2017

wesm commented Mar 24, 2017

teoliphant commented Mar 24, 2017

datnamer commented Mar 26, 2017

skrah commented Apr 5, 2018

skrah commented Apr 5, 2018 •

edited

Loading

wesm commented Apr 5, 2018

skrah commented Apr 5, 2018

wesm commented Apr 5, 2018

skrah commented Apr 5, 2018

wesm commented Apr 5, 2018

skrah commented Apr 5, 2018

wesm commented Apr 5, 2018

teoliphant commented Apr 5, 2018 •

edited

Loading

skrah commented Apr 6, 2018 •

edited

Loading

Datashape scalability #13

Datashape scalability #13

Comments

wesm commented Mar 16, 2017 • edited Loading

skrah commented Mar 17, 2017 • edited Loading

skrah commented Mar 17, 2017

wesm commented Mar 17, 2017

wesm commented Mar 17, 2017

teoliphant commented Mar 23, 2017

wesm commented Mar 23, 2017

teoliphant commented Mar 23, 2017

skrah commented Mar 23, 2017

skrah commented Mar 23, 2017

wesm commented Mar 23, 2017

wesm commented Mar 23, 2017 • edited Loading

skrah commented Mar 24, 2017

wesm commented Mar 24, 2017

teoliphant commented Mar 24, 2017

datnamer commented Mar 26, 2017

skrah commented Apr 5, 2018

skrah commented Apr 5, 2018 • edited Loading

wesm commented Apr 5, 2018

skrah commented Apr 5, 2018

wesm commented Apr 5, 2018

skrah commented Apr 5, 2018

wesm commented Apr 5, 2018

skrah commented Apr 5, 2018

wesm commented Apr 5, 2018

teoliphant commented Apr 5, 2018 • edited Loading

skrah commented Apr 6, 2018 • edited Loading

wesm commented Mar 16, 2017 •

edited

Loading

skrah commented Mar 17, 2017 •

edited

Loading

wesm commented Mar 23, 2017 •

edited

Loading

skrah commented Apr 5, 2018 •

edited

Loading

teoliphant commented Apr 5, 2018 •

edited

Loading

skrah commented Apr 6, 2018 •

edited

Loading