Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Python] Sparse array read with result_order is slow #2687

Open
bkmartinjr opened this issue Jun 6, 2024 · 1 comment
Open

[Bug][Python] Sparse array read with result_order is slow #2687

bkmartinjr opened this issue Jun 6, 2024 · 1 comment
Assignees

Comments

@bkmartinjr
Copy link
Member

The SOMASparseNdArray.read with result_order="row-major" is unexpectedly slow -- it is roughly 2X slower than calling read() (without sort), and then using PyArrow's sort_by method to perform the sort.

I would naively expect the TileDB-SOMA implementation to be faster as it is multi-threaded (Arrow is a single-threaded sort), or at worst they would be similar in speed.

Example, running on an EC2 instance in the same region as the S3 bucket:

  • the first two (12 and 13) are unsorted read, folllowed by Arrow Table sort - approx 2:40
  • the latter two (14 and 15) are read(result_order='row-major') - approx 5:00
In [10]: import tiledbsoma as soma

In [11]: E = soma.open("s3://tiledb-bruce/tmp_data/soma/ef220f25-dc26-40d9-98de-7e137d2e1803", context=soma.SOMATileDBContext(tiledb_config={'vfs.s3.region':'us-west-2'}))

In [12]: %time E.ms["RNA"].X["data"].read().tables().concat().sort_by([('soma_dim_0','ascending'),('soma_dim_1','ascending')])
CPU times: user 1min 27s, sys: 1min 40s, total: 3min 8s
Wall time: 2min 41s
Out[12]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,109451,109451,109451,109451,109451]]
soma_dim_1: [[2,3,4,8,9,...,59229,59230,59231,59232,59234]]
soma_data: [[7,1,2,30,3,...,4,1,7,1,1]]

In [13]: %time E.ms["RNA"].X["data"].read().tables().concat().sort_by([('soma_dim_0','ascending'),('soma_dim_1','ascending')])
CPU times: user 1min 26s, sys: 1min 34s, total: 3min 1s
Wall time: 2min 37s
Out[13]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,109451,109451,109451,109451,109451]]
soma_dim_1: [[2,3,4,8,9,...,59229,59230,59231,59232,59234]]
soma_data: [[7,1,2,30,3,...,4,1,7,1,1]]

In [14]: %time E.ms["RNA"].X["data"].read(result_order='row-major').tables().concat()
CPU times: user 9min 53s, sys: 9min 14s, total: 19min 7s
Wall time: 5min 47s
Out[14]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,255,255,255,255,255],[256,256,256,256,256,...,511,511,511,511,511],...,[109312,109312,109312,109312,109312,...,109451,109451,109451,109451,109451],[]]
soma_dim_1: [[2,3,4,8,9,...,59230,59231,59232,59233,59234],[3,6,8,10,11,...,59229,59230,59231,59233,59234],...,[34,47,52,86,91,...,59229,59230,59231,59232,59234],[]]
soma_data: [[7,1,2,30,3,...,11,24,1,2,1],[1,10,9,1,3,...,14,16,22,8,3],...,[2,3,4,1,1,...,4,1,7,1,1],[]]

In [15]: %time E.ms["RNA"].X["data"].read(result_order='row-major').tables().concat()
CPU times: user 9min 59s, sys: 8min 42s, total: 18min 41s
Wall time: 5min 4s
Out[15]: 
pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[0,0,0,0,0,...,255,255,255,255,255],[256,256,256,256,256,...,511,511,511,511,511],...,[109312,109312,109312,109312,109312,...,109451,109451,109451,109451,109451],[]]
soma_dim_1: [[2,3,4,8,9,...,59230,59231,59232,59233,59234],[3,6,8,10,11,...,59229,59230,59231,59233,59234],...,[34,47,52,86,91,...,59229,59230,59231,59232,59234],[]]
soma_data: [[7,1,2,30,3,...,11,24,1,2,1],[1,10,9,1,3,...,14,16,22,8,3],...,[2,3,4,1,1,...,4,1,7,1,1],[]]

Versions (please complete the following information):

tiledbsoma.__version__              1.11.4
TileDB-Py version                   0.29.0
TileDB core version (tiledb)        2.23.0
TileDB core version (libtiledbsoma) 2.23.0
python version                      3.11.9.final.0
OS version                          Linux 6.8.0-1009-aws
@johnkerl johnkerl self-assigned this Jun 6, 2024
@johnkerl johnkerl changed the title [Bug][Python] sparse array read with result_order is slow [Bug][Python] Sparse array read with result_order is slow Jul 2, 2024
@johnkerl johnkerl changed the title [Bug][Python] Sparse array read with result_order is slow [Bug][Python] Sparse array read with result_order is slow Jul 2, 2024
@johnkerl
Copy link
Member

[sc-51538]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants