Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow approx_size for DataFrames #48

Open
DrChainsaw opened this issue Mar 4, 2021 · 0 comments
Open

Very slow approx_size for DataFrames #48

DrChainsaw opened this issue Mar 4, 2021 · 0 comments

Comments

@DrChainsaw
Copy link
Contributor

When benchmarking parallel application which uses Dagger, it seems like MemPool.approx_size is the bottleneck due to it falling back to Base.summarysize.

Here is a quick MWE:

julia>  using BenchmarkTools, DataFrames, MemPool

julia> df = DataFrame(a=1:1000_000, b=randn(1000_000), c=repeat([:aa], 1000_000));

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  61.03 MiB
  allocs estimate:  1999540
  --------------
  minimum time:     110.895 ms (4.59% GC)
  median time:      119.604 ms (2.47% GC)
  mean time:        122.978 ms (2.83% GC)
  maximum time:     146.009 ms (1.46% GC)
  --------------
  samples:          41
  evals/sample:     1

Here is a sketch of an alternative implementation which is much faster:

julia> function MemPool.approx_size(df::DataFrame)
       dsize = mapreduce(MemPool.approx_size, +, eachcol(df))
       namesize = mapreduce(MemPool.approx_size, +, names(df))
       return dsize + namesize
       end

julia> @benchmark MemPool.approx_size($df)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  13
  --------------
  minimum time:     535.700 μs (0.00% GC)
  median time:      636.800 μs (0.00% GC)
  mean time:        664.967 μs (0.00% GC)
  maximum time:     1.525 ms (0.00% GC)
  --------------
  samples:          7499
  evals/sample:     1

The above implementation is not 100% correct, but I hope it shows that there is some potential for improvement.

Don't know if there is some interface which can be used to avoid the dependency, e.g. Tables.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant