Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cheb2leg is slow #97

Open
dlfivefifty opened this issue Dec 27, 2019 · 4 comments
Open

cheb2leg is slow #97

dlfivefifty opened this issue Dec 27, 2019 · 4 comments

Comments

@dlfivefifty
Copy link
Member

This seems very slow:

julia> @time cheb2leg(randn(1_000));
  0.014712 seconds (9 allocations: 16.438 KiB)

julia> @time cheb2leg(randn(10_000));
  0.491001 seconds (11 allocations: 156.969 KiB)

julia> @time cheb2leg(randn(100_000));
  8.688441 seconds (11 allocations: 1.527 MiB)
@dlfivefifty
Copy link
Member Author

I guess its all in the plan:

julia> @time p = plan_cheb2leg(randn(100_000));
  8.466608 seconds (7 allocations: 781.531 KiB)

julia> @time p * randn(100_000);
  0.160250 seconds (172 allocations: 1.536 MiB)

I remember the Toeplitz dot Hankel being much faster.

@MikaelSlevinsky
Copy link
Member

Indeed, the plans are slower than the Toeplitz-dot-Hankel approach but the execution times are much faster. Comparing with Figure 5 from the Webb, Townsend, Olver paper, I see:

n ToH plan_cheb2leg p * c
1,000 0.05 0.013 0.00043
10,000 0.2 0.37 0.0067
100,000 1.3 8.4 0.091

I think that for multiple transforms, this is a better alternative.

There is one (possibly surprising) reason the plans are slower than before: in C, the Float64 plans are first created in long double precision before being converted to double precision data structures to squeeze out the trailing bits more accurately. Similarly, in Float32, they are first created in double precision.

If one would accept 12-14 digits instead of 14-16, then the purely 64-bit plans would be fine for 64-bit execution. The plan time at 100_000 is then just a bit less than the current Float32 plan: @time p = plan_cheb2leg(randn(Float32, 100_000)); 2.153270 seconds (5 allocations: 208 bytes).

Notice that now, the 1D plans are also multithreaded, so:

julia> c = randn(Float64, 100_000);

julia> @time p = plan_cheb2leg(c);
  8.197012 seconds (5 allocations: 208 bytes)

julia> @time p*c;
  0.092681 seconds (8 allocations: 781.844 KiB)

julia> x = randn(Float64, 100_000, 20);

julia> @time p*x;
  0.514451 seconds (8 allocations: 15.259 MiB)

With four logical cores, the twenty-column execution is not twenty times slower.

@dlfivefifty
Copy link
Member Author

Ok so we should use Toeplitz dot Hankel for cheb2leg as the plan is only used once, but the current plan for plan_cheb2leg with the assumption that if you are planning you’ll be using many times.

@MikaelSlevinsky
Copy link
Member

Better would be to implement the Alpert--Rokhlin scheme in C for cases where the connection coefficients and bounds on off-diagonal block ranks are all known. See #75 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants