Indeed, the plans are slower than the Toeplitz-dot-Hankel approach but the execution times are much faster. Comparing with Figure 5 from the Webb, Townsend, Olver paper, I see:

n	ToH	plan_cheb2leg	p * c
1,000	0.05	0.013	0.00043
10,000	0.2	0.37	0.0067
100,000	1.3	8.4	0.091

I think that for multiple transforms, this is a better alternative.

There is one (possibly surprising) reason the plans are slower than before: in C, the Float64 plans are first created in long double precision before being converted to double precision data structures to squeeze out the trailing bits more accurately. Similarly, in Float32, they are first created in double precision.

If one would accept 12-14 digits instead of 14-16, then the purely 64-bit plans would be fine for 64-bit execution. The plan time at 100_000 is then just a bit less than the current Float32 plan: @time p = plan_cheb2leg(randn(Float32, 100_000)); 2.153270 seconds (5 allocations: 208 bytes).

Notice that now, the 1D plans are also multithreaded, so:

julia> c = randn(Float64, 100_000);

julia> @time p = plan_cheb2leg(c);
  8.197012 seconds (5 allocations: 208 bytes)

julia> @time p*c;
  0.092681 seconds (8 allocations: 781.844 KiB)

julia> x = randn(Float64, 100_000, 20);

julia> @time p*x;
  0.514451 seconds (8 allocations: 15.259 MiB)

With four logical cores, the twenty-column execution is not twenty times slower.

cheb2leg is slow #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions