Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add sparse non-negative OLS and WLS via QP for MinTraceSparse #319

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

christophertitchen
Copy link
Contributor

This PR improves the current sparse non-negative reconciliation in MinTraceSparse by expanding upon the heuristic from my previous PR, #284, to improve the quality of the non-negative reconciled forecasts for large numbers of time series. It converts the OLS and WLS estimators into quadratic programming (QP) problems with non-negativity constraints, similar to the approach in MinTrace, but most importantly avoiding the evaluation of the dense $S^{T}W^{-1}S$, thereby maintaining the sparsity of the problem.

The implementation in this PR is robust and performant, allowing users dealing with large numbers of time series—which is extremely common in industrial contexts—to reliably generate high quality non-negative OLS and WLS reconciled forecasts in a fraction of the time.

Dataset $\textup{dim} \left ( P \right )$ $K$ $T$ $h$ Method MinTrace MinTraceSparse Reduction
M5 $\small{\left( 30\, 490 \times 42\, 840 \right)}$ $12$ $1\, 969$ $28$ ols $\gt 2\, hr$ $4.94\, s$ $ \gt \, 99.9 \%$
M5 $\small{\left( 30\, 490 \times 42\, 840 \right)}$ $12$ $1\, 969$ $28$ wls_struct $\gt 2\, hr$ $1.24\, s$ $ \gt \, 99.9 \%$
Tourism $\small{\left( 304 \times 555 \right)}$ $8$ $228$ $12$ ols $119\, ms$ $28.4\, ms$ $76.1 \%$
Tourism $\small{\left( 304 \times 555 \right)}$ $8$ $228$ $12$ wls_struct $109\, ms$ $17.1\, ms$ $84.3 \%$
Tourism $\small{\left( 304 \times 555 \right)}$ $8$ $228$ $12$ wls_var $144\, ms$ $49\, ms$ $66.0 \%$

It adds two new QP solvers as dependencies to the package—OSQP and Clarabel.rs—which use augmented Lagrangian and interior point algorithms respectively. These two solvers are the most performant for these sorts of problems in my experience, and perform well in benchmarks. The new behaviour is triggered by default upon instantiation of MinTraceSparse when a nonnegative of True is passed, but can be toggled off by setting qp to False, which will revert to the method detailed in #284.

I will add GLS estimators, mint_cov and mint_shrink, in a future PR as we can exploit the symmetry of $W$ by making it an upper-triangular matrix, maintaining a reasonable degree of sparsity and decreasing the time taken to invert it.

Warm Start

via modified #284

As the OSQP solver supports manually warm starting primal and dual variables, we can modify the linear operator for $P$ used in #284 to directly project onto the non-negative orthant, using GCROT(m, k) to quickly solve the linear system and generate the reconciled non-negative forecasts, $x_{0}$, to warm start the primal variables with. We could use a Jacobi preconditioner as the diagonal of $A$ in the $Ax=b$ linear system is given by the row-wise sums of $S^{T}W^{-1}$, but it is probably not well-suited for diagonal preconditioning for most aggregation structures.

The generation of $x_{0}$ comes with a small performance penalty, but it offsets this by decreasing the number of iterations to solve the primary problem below as well as increasing the chance of solving it in my experience. Also, $x_{0}$ is used as a baseline to compare the post-processed QP primal solutions to and therefore acts as an important safeguard to ensure that regardless, we return non-negative reconciled forecasts that have an objective value at least as good as $x_{0}$.

Primary Approach

via OSQP

$\begin{split} \underset{x \in \mathbb{R}^{n}}{\text{minimise}} \quad & \frac{1}{2} x^{T}W^{-1}x - \left( W^{-1}\hat{y}_{h} \right)^{T}x \end{split}$

$\begin{split} \text{subject to} \quad & 0_{n} \leq \begin{pmatrix} I_{n_{a}} \quad -A \\ 0_{n_{b}, n_{a}} \quad I_{n_{b}} \end{pmatrix}x \leq \begin{pmatrix} 0_{n_{a}} \\ \infty_{n_{b}} \end{pmatrix} \end{split}$

where $A$ is the "aggregation matrix", which describes the relationship between the bottom level (leaf node) time series and the aggregated (parent node) time series, $W^{-1}$ is the precision matrix for the problem, and $\hat{y}_{h}$ are the base forecasts for a particular horizon step, $h$.

This formulation is the most efficient as $\left(I_{n_{a}} \quad -A\right)$ has the set of all reconciled forecasts in its null space, so the graph structure is encoded in just $n_{a}$ equality constraints, with a remaining $n_{b}$ inequality constraints to constrain $x$ to the convex, non-negative feasible region of the solution space.

We try this approach first for each $h$, and if the solver solves the problem, even with reduced accuracy, the primal solutions are post-processed by clipping to $[0, \infty)$ and aggregating with $S$, then compared to $x_{0}$ to ensure they are objectively better, and the best non-negative reconciled forecasts are returned for each $h$. If the OSQP solver exceeds the maximum number of iterations for a given $h$ without finding a solution, or finds the problem to be infeasible, the secondary approach below is employed by default as a fallback, else if fallback is set to False, $x_{0}$ is returned.

Secondary Approach

via Clarabel

$\begin{split} \underset{x \in \mathbb{R}^{n_{b}}, \; z \in \mathbb{R}^{n}}{\text{minimise}} \quad & \frac{1}{2} \begin{pmatrix} x \\ z \end{pmatrix}^{T} \begin{pmatrix} 0_{n_{b}, n + n_{b}} \\ 0_{n, n_{b}} \quad W^{-1} \end{pmatrix} \begin{pmatrix} x \\ z \end{pmatrix} \end{split}$

$\begin{split} \text{subject to} \quad & Sx - \hat{y}_{h} = z \\ & \begin{pmatrix} S \quad -I_{n} \\ -I_{n_{b}, n + n_{b}} \end{pmatrix} \begin{pmatrix} x \\ z \end{pmatrix} + s = \begin{pmatrix} \hat{y}_{h} \\ 0_{n_{b}} \end{pmatrix} \\ & s \in \left\{ 0 \right\}^{n} \times \left\{ x \in \mathbb{R}^{n_{b}} : x_{i} \ge 0, \forall i = 1, \cdots, n_{b} \right\} \end{split}$

where $S$ is the "summing matrix", $z$ is an auxiliary variable, and $s$ is a composition (product) of two convex cones.

This formulation is an alternative way of framing the problem in the Wickramasuriya, Turlach, and Hyndman paper, that still maintains the sparsity. The secondary approach increases the decision variables from $n$ to $n$ + $n_{b}$ and the "nnz" of the linear constraints matrix by $2n_{b}$ relative to the primary approach, so it is a tad slower for most cases, but it tends to be more successful in my experience. Also, pairing it with a different solver, in this case Clarabel, maximises our chances of finding optimal non-negative reconciled forecasts.

Post-process

As mentioned above, if either OSQP or Clarabel return a solution, it is post-processed by clipping to $[0, \infty)$ and aggregated with $S$, then compared to $x_{0}$ to ensure it is objectively better.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@elephaint
Copy link
Contributor

elephaint commented Dec 19, 2024

Thanks this is great work!

Some minor comments so far:

  • You can fix the linting failing by clearing the notebook and making sure the notebooks are cleaned too. See contributing.md in the main repo folder.
  • I'm not too happy about adding two more solvers as dependencies, but if it makes sense performance-wise we could go for it. I do think I'd rather keep one qprog solvers instead of two, if possible. So we may want to get rid of quadprog then.
  • I think it's important to have a test showing the behavior of the old method compared to the two new methods, both in terms of accuracy and speed. For example, a test that reproduces the table you show above for M5 and tourism. This is because if we add a lot of additional complexity + dependencies, just want to make sure we have validated the benefits in accuracy and efficiency.
  • Maybe we should opt for choosing the method of preference over doing everything. I.e. choose 'heuristic', 'qp_osqp', 'qp_clarabel' or 'combined' as a string choice in the instantiation of MinTraceSparse. The default is now 'combined' , i.e. it performs a lot of work by doing everything, right? And I think we should have the default depend on how clear the benefit is based on the tests. E.g. if very minor benefit in speed & accuracy -> keep 'heuristic' as default, and the new dependencies are not added as a dependency but something users should install themselves when calling these options.
  • How do we know x_0 is a good warm start? Do we arrive at the same solution albeit with more iterations if we start from a different (random) initialization?

@christophertitchen
Copy link
Contributor Author

Thanks this is great work!

Some minor comments so far:

  • You can fix the linting failing by clearing the notebook and making sure the notebooks are cleaned too. See contributing.md in the main repo folder.
  • I'm not too happy about adding two more solvers as dependencies, but if it makes sense performance-wise we could go for it. I do think I'd rather keep one qprog solvers instead of two, if possible. So we may want to get rid of quadprog then.
  • I think it's important to have a test showing the behavior of the old method compared to the two new methods, both in terms of accuracy and speed. For example, a test that reproduces the table you show above for M5 and tourism. This is because if we add a lot of additional complexity + dependencies, just want to make sure we have validated the benefits in accuracy and efficiency.
  • Maybe we should opt for choosing the method of preference over doing everything. I.e. choose 'heuristic', 'qp_osqp', 'qp_clarabel' or 'combined' as a string choice in the instantiation of MinTraceSparse. The default is now 'combined' , i.e. it performs a lot of work by doing everything, right? And I think we should have the default depend on how clear the benefit is based on the tests. E.g. if very minor benefit in speed & accuracy -> keep 'heuristic' as default, and the new dependencies are not added as a dependency but something users should install themselves when calling these options.
  • How do we know x_0 is a good warm start? Do we arrive at the same solution albeit with more iterations if we start from a different (random) initialization?

Hi, Olivier! I hope you're well.

The linting issue is strange, thanks for pointing that out! I thought I ran nbdev_clean before staging, but I'll double check my environment before my next commit.

Yup, I also thought that three is close to excessive. The good news is that they're both light on dependencies as both have numpy and scipy dependencies which don't conflict with ours. OSQP has a third dependency called qdldl.

Would you rather scale it down to one sparse QP solver? It's straightforward to express the secondary approach in a problem format that OSQP will accept as the only change is reformulating the linear constraints, or vice versa for Clarabel. In my experience, the trade-off between the two in this context is that OSQP is slightly faster than Clarabel but slightly less reliable, particularly for the second approach unless you set higher accuracy, hence the incorporation of the two solvers.

Unfortunately, we have to keep quadprog for now until MinTraceSparse has all of the functionality of MinTrace, as neither OSQP nor Clarabel support dense matrices.

My design decision for the PR just comes down to my workflow in my job and what I think is most valuable in a large business context. When evaluating tens or hundreds of millions of forecasts, it's important to have a process that's robust and performant. At that scale, in this case it's inconsequential whether for example, the QP solvers struggled with a particular reconciliation problem, or which if any of the two approaches were successful, as long as the best estimate coherent non-negative forecasts are returned for a given $\hat{y}_{h}$ and method, which is ultimately the output that's useful in a business context. Hence, the fallback to the secondary approach with Clarabel to try to obtain a high quality solution if the primary approach with OSQP is unsuccessful, along with the important use of the heuristic solution as a baseline and a safeguard, as both are fast to compute.

I understand if you think this methodology isn't suitable for most users of the package. If you'd like, I can easily simplify it and scale it way back to something like MinTrace, where it attempts to solve the problem (we can just use the primary approach) and raises an exception if it's unsuccessful, rather than warning the user and returning the best estimate, like the heuristic solution, for them to use and evaluate regardless.

I find the QP solutions almost always improve upon the heuristic solution if successful, which you can measure by using the heuristic solution as a baseline and calculating metrics like the MSSE or RMSSE. I can put something together that quantitatively shows the differences between the QP approaches and the heuristic approach and measures the alignment of the two approaches with the dense approach in MinTrace at different tolerances.

This is purely anecdotal, but using $x_{0}$ as a warm start helps to either decrease the number of iterations to arrive at the same solution, or make a problem that is supposedly dual infeasible be solved, at least with reduced accuracy, within the maximum number of iterations. In the first case, the overhead to generate $x_{0}$ isn't usually worth the reduced number of iterations in my opinion, but given that the latter case can also occur, along with the use of $x_{0}$ as a baseline and a safeguard, it's something that I find useful.

Let me know what you think and I can do the tests and make the changes. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants