Performance against dask read_sql_table #263

argenisleon · 2020-05-23T06:12:10Z

Hi,

Is there any benchmark against pandas or dask?. I am thinking about using turbodbc in https://github.com/ironmussa/optimus to move data from databases to cudf and dask-cudf?

Any idea?

xhochy · 2020-05-23T06:26:52Z

Have a look at @MathMagique presentation starting at 20:00 http://2017.de.pycon.org/schedule/talks/turbodbc-turbocharged-database-access-for-data-scientists/

The PEP-249 performance should be roughly similar to the pandas.read_sql_table performance. There you can see what the performance differences are. Going to cudf which the Turbodbc Arrow Adapter might be even more efficient as you should be able to avoid the roundtrip through pandas as cudf also uses Arrow as its memory layout.

dhirschfeld · 2020-05-23T07:46:19Z

I did a comparison against mssql/sqlalchemy, fetching 1e6 records from a SQL Server database and got a 6x speedup with turbodbc:

...and that includes the cost of converting to pandas. Plans are to try and avoid that overhead with fletcher.

xhochy · 2020-05-23T07:56:42Z

Thanks for doing this @dhirschfeld !

argenisleon · 2020-05-23T16:10:07Z

Amazing talk @MathMagique , and thanks for the info @xhochy @dhirschfeld.

Internally dask uses the table index to parallelize the data reading. Any idea on how this could play with turbodbc? Could be any gain in using dask for this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance against dask read_sql_table #263

Performance against dask read_sql_table #263

argenisleon commented May 23, 2020

xhochy commented May 23, 2020

dhirschfeld commented May 23, 2020

xhochy commented May 23, 2020

argenisleon commented May 23, 2020

Performance against dask read_sql_table #263

Performance against dask read_sql_table #263

Comments

argenisleon commented May 23, 2020

xhochy commented May 23, 2020

dhirschfeld commented May 23, 2020

xhochy commented May 23, 2020

argenisleon commented May 23, 2020