Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow @astable macro outside a function #363

Open
mbataillou opened this issue Sep 2, 2023 · 4 comments
Open

Very slow @astable macro outside a function #363

mbataillou opened this issue Sep 2, 2023 · 4 comments

Comments

@mbataillou
Copy link

mbataillou commented Sep 2, 2023

Here is the experiment.

Given the dataframe and functions f0, f1 below

using DataFrames, DataFramesMeta, StatsBase
df = DataFrame(a=1:10_000)  # I know the df is small but big enough to show the issue
f0(df::DataFrame) = begin
	@chain df begin
		@rtransform(:b = :a * 10)
		@rtransform(:c = mean(:b))
		@rtransform(:d = :b - :c)
		@select(:a, :d)
	end
end
f1(df::DataFrame) = begin
	@chain df begin
		@rtransform @astable begin
			b = :a * 10
			c = mean(b)
			:d = b - c
		end
	end
end

We get an improvement in performance in f1, which is what one would expect given it does not need to create columns b, c .

@time f0(df)
0.001146 seconds (728 allocations: 898.516 KiB)
@time f1(df)
0.000503 seconds (161 allocations: 243.609 KiB)

However, if one uses this code outside a function (see below) it becomes 46 times slower! Making it unusable for datasets of a larger size.

@time @chain df begin
	@rtransform @astable begin
		b = :a * 10
		c = mean(b)
		:d = b - c
	end
end
->  2.331518 seconds (335.93 k allocations: 13.028 MiB, 4.69% compilation time)

@time @chain df begin
	@rtransform(:b = :a * 10)
	@rtransform(:c = mean(:b))
	@rtransform(:d = :b - :c)
	@select(:a, :d)
end
->  0.056910 seconds (34.81 k allocations: 3.137 MiB, 95.06% compilation time)

Thanks for the great work :)

@pdeffebach
Copy link
Collaborator

Thank you for your bug report!

It is true that @astable will be slower outside of a function. The reason is that DataFramesMeta.jl creates anonymous functions which and calling them requires specialization on input and output type. For the @astable behavior, the output type is complicated, as the names and values are stored in type information of a NamedTuple.

That said, I can't replicate your issue on version 1.9. I get roughly comparable timings.

julia> @time @chain df begin
               @rtransform @astable begin
                       b = :a * 10
                       c = mean(b)
                       :d = b - c
               end
       end;
  0.052417 seconds (64.17 k allocations: 3.484 MiB, 97.60% compilation time)

julia> @time @chain df begin
               @rtransform(:b = :a * 10)
               @rtransform(:c = mean(:b))
               @rtransform(:d = :b - :c)
               @select(:a, :d)
       end;
  0.045937 seconds (45.52 k allocations: 3.869 MiB, 96.93% compilation time)

Can you double check your measurements and give me your version info?

@mbataillou
Copy link
Author

So I double checked the results and found something potentially interesting!

In the REPL I get a similar timing just as you, but the above timings came from running inside a Pluto notebook.
To triple check I ran the above in a Pluto notebook, and there you get the 40x slow down again.

Seems like the issue is using @astable outside a function in a Pluto notebook.

Here is my version info

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 6 on 12 virtual cores
Environment:
  JULIA_DEPOT_PATH = /Users/xxx/.julia:/Applications/Julia-1.9.app/Contents/Resources/julia/local/share/julia:/Applications/Julia-1.9.app/Contents/Resources/julia/share/julia
  JULIA_LOAD_PATH = @:@v#.#:@stdlib
  JULIA_REVISE_WORKER_ONLY = 1

@pdeffebach
Copy link
Collaborator

That's pretty frustrating. Can you post an issue with Pluto.jl? (Or their internal tracker or whatever they use) and I can cross-link with this issue?

@pdeffebach
Copy link
Collaborator

Bumping this @mbataillou did you ever file an issue with Pluto.jl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants