Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datashape scalability #13

Open
wesm opened this issue Mar 16, 2017 · 26 comments
Open

Datashape scalability #13

wesm opened this issue Mar 16, 2017 · 26 comments

Comments

@wesm
Copy link

wesm commented Mar 16, 2017

I have a high level design question concerning using text as a serialized representation of array metadata. In my opinion, it is not the best choice as a primary representation. Let me explain why

A problem that the big data community has run into concerns "large metadata" -- storing very wide or very complex tables in file formats like Parquet has resulted in performance problems because manipulating the metadata itself (serialized as a Thrift message) is unwieldy -- parsing and serializing is expensive, and doing small projections / selecting certain fields out of a large data set incurs the cost of the entire metadata.

In Apache Arrow, we have a more restricted type system than datashape, but we decided to use a "zero copy" or "no parse" technology, Flatbuffers (from Google), to represent metadata in a portable way. We chose Flatbuffers over alternatives (like Cap'n Proto -- created by the main author of Google's Protocol Buffers file format) because of community traction and cross-language support (it has first class support for C++, C#, C, Go, Java, JS, PHP, and Python, with Rust and other languages in the works).

Using a technology like Flatbuffers for the serialized representation enables O(1) inspection of schemas, without any copying or parsing overhead, regardless of how big the schema is. Have you thought about using something like this instead?

@skrah
Copy link
Member

skrah commented Mar 17, 2017

Thanks for the ideas! I would be curious about the performance aspect before addressing the other points. FlatBuffers has a benchmark that uses a very small struct:

https://google.github.io/flatbuffers/flatbuffers_benchmarks.html

libndtypes2 has a benchmark that uses a huge struct (make bench). The file is:

https://github.com/blaze/libndtypes2/blob/master/tools/bench.c

Parsing and deallocating the huge struct 1million times takes 7.2s on my slowish T400 machine.
Using the small struct from the FlatBuffers benchmark in tools/bench.c, 1 million repetitions take 2.5s.

Unfortunately the FlatBuffers benchmarks don't build here, so I'm using their self-reported time (likely measured on a faster machine that the T400).

Encode (also 1 million times): 3.2s.
Dealloc (also 1 million times): 0.08s.

So on the surface creating and deallocating the types seems quite fast in libndtypes2.

@skrah
Copy link
Member

skrah commented Mar 17, 2017

Correction, the benchmark used the old syntax and was broken (should be fixed now in master). The real numbers are:

Huge struct, 1 million times: 112s
Small struct, 1 million times: 5.4s

So it is 5.4 vs 3.2 (self-reported -- machine?, exact benchmark?).

@wesm
Copy link
Author

wesm commented Mar 17, 2017

So these numbers are "time to generate a serialized string from an in-memory schema"? The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements -- the issues I'm describing would show up in use cases like "read the metadata for a given subset of 100 particular fields".

Memory efficiency is also a concern -- one of the benefits of Flatbuffers is that you can select what you want from e.g. a memory-mapped file without any unnecessary copying.

I can collaborate with you on some benchmarks for these use cases sometime in the next week or so if you'd like.

@wesm
Copy link
Author

wesm commented Mar 17, 2017

Sorry reading this again

So it is 5.4 vs 3.2

These numbers aren't comparable because the tasks aren't the same. To do a proper comparison we will have to write Flatbuffers IDL that encodes identical data to what's being parsed in libndtypes2. I can help with this

@teoliphant
Copy link
Member

Thanks for the reference and connection. I want to make sure we are on the same page as data-shape is only about describing data in memory or disk and not a serialization strategy.

Are you suggesting that we use flat-buffers to store the data-shape meta-data rather than the current data-shape string? An standard for encoding data-shape meta-data to avoid the parsing step at all sounds great.

@wesm
Copy link
Author

wesm commented Mar 23, 2017

That's right. When you use zero-copy tools up and down the stack, the on disk representation and the in-memory one are the same. So it's suboptimal for metadata to take longer to "load" for larger schemas when general zero-copy "serialization" tools like this exist.

@teoliphant
Copy link
Member

Great suggestion! I just wanted to make sure I understood and Stefan understood.

@skrah
Copy link
Member

skrah commented Mar 23, 2017

I'm still trying to figure this out -- Think about this potential datashape use case, e.g. typing a Numba gufunc. This is the existing syntax without datashape:

@guvectorize([(float64[:,:], float64[:,:], float64[:,:])], '(m,n),(n,p)->(m,p)')
def numba_matmul(x, y, res):
    col = np.arange(y.shape[0])
    for j in range(y.shape[1]):
        for k in range(y.shape[0]):
            col[k] = y[k, j]
        for i in range(x.shape[0]):
            s = 0
            for k in range(x.shape[1]):
                s += x[i, k] * col[k]

With datashape it could look like this:

@guvectorize('(M * N * float64, N * P * float64) -> M * P * float64')
def numba_matmul(x, y, res):
    ...

The reasonable thing to do here would be to parse the datashape string once and keep the datashape tree alive in memory. The tree can be walked directly to find elements, guide gufuncs to the proper location and so forth.

Can flatbuffers store ndarrays at all? How would the symbolic dimensions used above be represented?

@skrah
Copy link
Member

skrah commented Mar 23, 2017

The easiest way to see how complex the trees get is to print the AST:

make print_ast
echo "(M * N * float64, N * P * float64) -> M * P * float64" | ./print_ast -

@wesm
Copy link
Author

wesm commented Mar 23, 2017

The reasonable thing to do here would be to parse the datashape string once and keep the datashape tree alive in memory. The tree can be walked directly to find elements, guide gufuncs to the proper location and so forth.

I think you're arguing based on a misconception. If you want to convert the Flatbuffer to some other data structure (like the C structs you have defined in this project), you can certainly do that. But it is a zero-copy data structure that you can open and examine without any parsing overhead.

Want to work on a prototype together?

@wesm
Copy link
Author

wesm commented Mar 23, 2017

Here's an example type that can store ndarray metadata:

table FixedDimension {
  size: long;
}

table SymbolicDimension {
  name: string;
}

union Dimension {
  FixedDimension,
  SymbolicDimension
}

table Tensor {
  value_type: Type;
  shape: [Dimension];
}

(I'm not certain whether you can have an array of union values, you may need to wrap them in a non-union container)

@skrah
Copy link
Member

skrah commented Mar 24, 2017

Thanks for the example! So the suggestion is to replace all data structures defined in

https://github.com/blaze/libndtypes2/blob/master/ndtypes.h

with the flatbuffer schema and use flatbuffers for everything, including type tree traversal for gufuncs and pattern matching of types (this needs to be fast, surely there's some overhead in flatbuffers)?

@wesm
Copy link
Author

wesm commented Mar 24, 2017

It's definitely a better primary serialized representation instead of a string that you have to parse. It also gives you the ability to read the metadata in all of the languages that Flatbuffers support. You could use another tool like Protocol Buffers or Thrift for the same purpose, but as discussed the parsing overhead will grow burdensome with large schemas

As far as performance, we should write some microbenchmarks to understand the number of nanoseconds it takes to perform certain performance tasks. Do you have any code examples for what you're describing?

In C++ at least, the generated bindings are header only so everything gets inlined

@teoliphant
Copy link
Member

Hey Stefan. If I understand correctly, flatbuffers would not replace any code in ndtypes.h for now nor would it replace the datashape strings we are using in numba or gumath. It could, however be used as a representation of datashape that is passed between computer processes.

This looks like a very useful optimization, but I would not place it on the current critical path for you. A PR would be welcome, however.

@datnamer
Copy link

Can't the type pattern matching be inlined by numba?

@skrah
Copy link
Member

skrah commented Apr 5, 2018

@wesm "The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements."

I've implemented no-parse serializing/deserializing, here's a benchmark:

https://github.com/plures/ndtypes/blob/master/python/bench.py

As expected, the largest speedup is for a ragged array with 10000000 offsets. The speedup for a large struct is nice, but not spectacular:

Parse 10_000_000 var offsets:
4.600749731063843

Deserialize 10_000_000 var offsets:
0.029728412628173828

Parse large type (100_000 repetitions):
15.476191997528076

Deserialize large type (100_000 repetitions):
3.6464648246765137

I'd be interested in the Arrow numbers for a type with 10000000 offsets, but I cannot figure out how to serialize the offsets.

@skrah
Copy link
Member

skrah commented Apr 5, 2018

I managed to pickle Arrow types, and here are the numbers (ndtypes master, pyarrow-0.9.0 wheel from pypi). I'm not sure if this is a fair comparison, but ndtypes does quite well.

from ndtypes import ndt
import pyarrow as pa
import pickle
import time


s = "{%s}" % ", ".join("x%s: int64" % n for n in range(100000))
t = ndt(s)

print("Ndtypes: pickle/unpickle huge struct:\n")
start = time.time()
s = pickle.dumps(t)
u = pickle.loads(s)
end = time.time()
print(end-start)
assert u == t



fields = [pa.field('x%s' % n, pa.int64()) for n in range(100000)]
t = pa.struct(fields)

print("\nArrow: pickle/unpickle huge struct:\n")
start = time.time()
s = pickle.dumps(t)
u = pickle.loads(s)
end = time.time()
print(end-start)
assert u == t
Ndtypes: pickle/unpickle huge struct:

0.07116484642028809

Arrow: pickle/unpickle huge struct:

2.2060387134552

@wesm
Copy link
Author

wesm commented Apr 5, 2018

I don't think that's a useful comparison. It would be more comparable to measure against a Flatbuffers representation

@skrah
Copy link
Member

skrah commented Apr 5, 2018

Why? I'm pickling/unpickling an in-memory ndt_t and an in-memory <class 'pyarrow.lib.StructType'>.

At least it looks that way to me.

@wesm
Copy link
Author

wesm commented Apr 5, 2018

Pickling is not a primary serialization path; it is merely a convenience for Python users to be able to pickle the pyarrow container types.

@skrah
Copy link
Member

skrah commented Apr 5, 2018

I'm trying to answer your question: "The numbers I'm interested in are times to parse (deserialize) and emit (serialize) large schemas, with 1 million or more elements."

I accept if this is not a useful comparison, but could you please post the fast way to serialize/deserialize so I can see if this issue has been addressed or if we still need to catch up with flatbuffers?

@wesm
Copy link
Author

wesm commented Apr 5, 2018

I'm off of open source this month -- I can possibly take a look next month

@skrah
Copy link
Member

skrah commented Apr 5, 2018

Ah sorry, let's postpone this then!

@wesm
Copy link
Author

wesm commented Apr 5, 2018

Just to go back to the original point: I was not trying to compare Apache Arrow with libndtypes. I was pointing out that you invented a structured data serialization scheme for the ndtypes metadata, and there are already libraries (Flatbuffers, Protocol Buffers, Apache Thrift) designed by companies like Facebook and Google that solve that problem with certain benefits like forward compatibility / schema evolution, and possibly also better performance (though this latter point will have to be analyzed further).

@teoliphant
Copy link
Member

teoliphant commented Apr 5, 2018

Thanks for clarifying, Wes. How to leverage Flatbuffers remains a valid point that we should consider. I appreciate your pointing out the similarities you see. With so much happening in open-source today, it is easy to miss-out on projects that would help achieve the ultimate goal.

Flatbuffers certainly have nice features like versioning which libndtypes doesn't really need yet. I think one question we will always have is if it is really easier to depend on Flatbuffers or just use libndt if you are trying to speed up parsing data-shape between libraries using it to describe typed containers.

I might be missing something, but I don't yet see the benefit of introducing a dependency on Flatbuffers.

@skrah
Copy link
Member

skrah commented Apr 6, 2018

One non-intrusive way of supporting flatbuffers would be to convert libndtypes/serialize/serialize.c and libndtypes/serialize/deserialize.c to fbserialize.cpp and fbdeserialize.cpp and add --with-flatbuffers to ./configure.

So the C++ dependency would be optional.

Depending on the design of flatbuffers, it is hopefully s straightforward as converting write_int64 etc. to fbwrite_int64 and such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants