Skip to content

Commit

Permalink
Documentation: Improve section about batch operations with pandas
Browse files Browse the repository at this point in the history
Specifically, outline _two_ concrete considerations for determining
the optimal chunk size.
  • Loading branch information
amotl committed May 11, 2023
1 parent dbf9293 commit 55928e9
Showing 1 changed file with 21 additions and 7 deletions.
28 changes: 21 additions & 7 deletions docs/by-example/sqlalchemy/dataframe.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,14 +70,26 @@ multiple batches, using a defined chunk size.

You will observe that the optimal chunk size highly depends on the shape of
your data, specifically the width of each record, i.e. the number of columns
and their individual sizes. You will need to determine a good chunk size by
running corresponding experiments on your own behalf. For that purpose, you
can use the `insert_pandas.py`_ program as a blueprint.
and their individual sizes, which will in the end determine the total size of
each batch/chunk.

It is a good idea to start your explorations with a chunk size of 5000, and
Two specific things should be taken into consideration when determining the
optimal chunk size for a specific dataset. First, when working with data
larger than main memory available on your machine, each chunk should be small
enough to fit into the memory, but large enough to minimize the overhead of a
single data insert operation. Second, as each batch is submitted using HTTP,
you should know about the request size limits of your HTTP infrastructure.

You will need to determine a good chunk size by running corresponding experiments
on your own behalf. For that purpose, you can use the `insert_pandas.py`_ program
as a blueprint.

It is a good idea to start your explorations with a chunk size of 5_000, and
then see if performance improves when you increase or decrease that figure.
Chunk sizes of 20000 may also be applicable, but make sure to take the limits
of your HTTP infrastructure into consideration.
People are reporting that 10_000 is their optimal setting, but if you have,
for example, just three columns, you may also experiment with `leveling up to
200_000`_, because `the chunksize should not be too small`_. If it is too
small, the I/O cost will be too high to overcome the benefit of batching.

In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
data means in the context of `DataFrame computing`_, let us refer you to `a
Expand All @@ -93,14 +105,16 @@ multiple batches, using a defined chunk size.
.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
.. _DataFrame computing: https://realpython.com/pandas-dataframe/
.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
.. _insert_pandas.py: https://github.com/crate/cratedb-examples/blob/main/by-language/python/insert_pandas.py
.. _leveling up to 200_000: https://acepor.github.io/2017/08/03/using-chunksize/
.. _NumPy: https://en.wikipedia.org/wiki/NumPy
.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software)
.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
.. _relational databases: https://en.wikipedia.org/wiki/Relational_database
.. _SQL: https://en.wikipedia.org/wiki/SQL
.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
.. _the chunksize should not be too small: https://acepor.github.io/2017/08/03/using-chunksize/
.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data
.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data

0 comments on commit 55928e9

Please sign in to comment.