lec03.Rmd

---
title: "Lecture 3"
author: "DJM"
date: "9 October 2018"
output:
  pdf_document: default
  slidy_presentation:
    css: http://mypage.iu.edu/~dajmcdon/teaching/djmRslidy.css
    font_adjustment: 0
bibliography: booth-refs.bib
---

\newcommand{\cdist}{\rightsquigarrow}
\newcommand{\cprob}{\xrightarrow{P}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\renewcommand{\P}{\mathbb{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\argmin}[1]{\underset{#1}{\textrm{argmin}}}
\newcommand{\F}{\mathcal{F}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\indicator}{\mathbb{1}}
\renewcommand{\bar}{\overline}
\renewcommand{\hat}{\widehat}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\X}{\mathbb{X}}
\newcommand{\Set}[1]{\left\{#1\right\}}
\newcommand{\dom}[1]{\textrm{dom}\left(#1\right)}
\newcommand{\st}{\;\;\textrm{ s.t. }\;\;}

## Why learn convex optimization?

- Many software packages have built-in optimizers
- This is fine in certain cases (I have a function! Throw it in there.)
- Will work "fine" some of the time.
- Immensely suboptimal most of the time

## Questions to answer

- Is it convex?
- Can you take analytic derivatives?
- Can you solve them for zero?
- Is the problem small, low dimensional?
- Do you need to solve it only once?
- Can you find the analytic Hessian?

Built-in optimizers are good if most of these answers are yes, or if they are all no.

In other cases, it's better to write custom code.

Knowing a little about optimization can reveal properties of the solutions.

Most packages do many things

```{r}
head(optim)
```

## Classes of algorithms

- General purpose (Simulated Annealing, Genetic Algorithms)
- First order 
- Second order
- Constrained/unconstrained?
- Linear/non-linear?
- Convex/Non-convex?

We're going to talk about first-order methods for convex problems.

# Convex sets and functions

## Definitions

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, autodep=TRUE, message=FALSE, warning = FALSE,
                      fig.align='center')
library(tidyverse)
theme_set(theme_minimal(base_family="Times"))
green = '#00AF64'
blue = '#0B61A4'
red = '#FF4900'
orange = '#FF9200'
colvec = c(green,blue,red,orange)
```

Thanks: Much of this material is borrowed/copied from Ryan Tibshirani. 

Set $C$ is *convex* iff
$\forall x,c \in C, \forall t \in [0; 1]\ \ t x + (1-t) y \in C.$

So $C$ is convex iff for any two points in $C$ their segment is also
entirely in $C$.

*Convex combination* of set of points
$x_1, \ldots, x_k \in \mathbb{R}^n$ is
$$\Set{\sum\limits_{i=1}^{k} \Theta_i x_i : \sum\limits_{i=1}^{k} \Theta_i=1,\ \forall i\ \Theta_i \in [0; 1]}.$$

*Convex hull* of any $C \in \mathbb{R}^n$, denoted $conv(C)$ is a union
of all convex combinations of different elements of $C$.

## Some examples

-   Empty set, point, line, segment.

-   Norm ball: $\Set{x : \left\lVert x \right\rVert < r}$.

-   Hyperplane $\Set{x : {a}^\top x = b}$, Affine space
    $\Set{x : A x = b}$.

-   Hyperspace: $\Set{x : {a}^\top x \leq b}$, Polyhedron
    $\Set{x : A x \leq b}$.

-   Cone such that if $x_1, x_2 \in C$ then
    $t_1 x_1 + t_2 x_2 \in C \ \forall t_1, t_2 \geq 0$.
    
## Cones

Set $C$ is a *cone* iff
$\forall t \geq 0,\ x \in C \implies t^\top x \in C$.

Type of cones:

-   Norm cone: $\Set{(x,t) : \left\lVert x \right\rVert \leq t}$.

-   Normal cone for some $C$ and $x \in C$:
    $N_C(x) = \Set{g : {g}^\top x \geq {g}^\top y \ \forall y \in C}$.

-   Positive semidefinite cone $S_+^n = \Set{x \in S^n : x \succeq 0}$,
    $S^n$ is Hilbert space.
    
## Key properties of convex sets

-   Separating hyperplane. $A, B$ are convex, nonempty, disjoint. Then
    $\exists a,b:\ A \subseteq \Set{x : {a}^\top x \leq b}, B \subseteq \Set{x : {a}^\top \geq b}$.

-   Supporting hyperplane. $C$ nonempty, convex, $x_0 \in boundary(C)$.
    Then
    $\exists a:\ C \subseteq \Set{x : {a}^\top x \leq {a}^\top x_0}$.

## Operations preserving convexity

-   Intersection.

-   Scaling, translation. $C$ is convex $\implies a C + b$ is convex.

-   Affine image and preimage. $f(x) = A x + b$, $C$ is convex
    $\implies f(C), f^{-1}(C)$ are convex.

-   Lots more (See [@BoydVandenberghe2004], chapter $2$).

$A_1, \ldots, A_k, B \in \mathbb{S}^n$ -- symmetric matrices. Then
$C = \Set{x \in \mathbb{R}^k : \sum\limits_{i=1}^{k} x_i A_i \preceq B}$.

$f: \mathbb{R}^k \to \mathbb{S}^n$,
$f(x) = B - \sum\limits_{i=1}^{k} x_i A_i$. $C = f^{-1}(S_+^n)$ --
affine preimage of convex cone.


## Convex functions

Function $f: \mathbb{R}^n\to \mathbb{R}$ is *convex* iff
$\dom{f} \subseteq \mathbb{R}^n$ is convex and
$$\forall x,y \in \dom{f}, t \in [0;1] \ \ \ f(t x + (1-t) y) \leq t f(x) + (1-t) f(y)$$

Other definitions:

-   $f$ is *concave* iff $-f$ is convex.

-   $f$ is *strictly convex* iff $\forall t \in (0; 1)$ the inequality
    in definition is strict.

-   $f$ is *strongly convex* with parameter $\tau$ iff
    $f(x) - \frac \tau 2 \left\lVert x \right\rVert_2^2$ is convex.

## Examples

-   $f(x) = \frac 1 x$ is strictly convex, but not strongly.

-   Univariate functions:

    -   $e^{ax}$ is convex $\forall a \in \mathbb{R}$ over $\mathbb{R}$.

    -   $x^a$ convex given $a \geq 1$ or $a \leq 0$ over $\mathbb{R}_+$.

    -   $\log x$ is concave over $\mathbb{R}_+$.

-   Affine ${a}^\top x + b$ is both convex and concave.

-   Quadratic $\frac 1 2 {x}^\top Q x + {b}^\top x + c$ is
    convex if $Q \succeq 0$.

-   $\left\lVert u - A x \right\rVert_2^2$ convex since
    ${A}^\top A \succeq 0$.

-   Norms: all vector norms and most matrix norms are convex.

-   Indicator function is convex. $C$ is a convex set, then
    $I_C(x) = \begin{cases}
                    0,\ x \in C \\
                    \infty, otherwise
                  \end{cases}$.

-   Support function is convex $\forall C$.
    $I_C^*(x) = \max\limits_{y \in C} {x}^\top y$.

## Key properties

-   $f$ is convex iff its epigraph is convex, where
    $epi(f) = \Set{ (x,t) \in \dom{f} \times \R : f(x) \leq t}$.

-   $f$ is convex $\implies$ all its sublevel sets are convex.
    $C_t = \Set{x \in \dom{f} : f(x) \leq t}$. The converse is false.

-   Assume $f$ is differentiable. Then $f$ is convex iff $\dom f$ is
    convex and
    $\forall x,y \in \dom{f}\ \ f(y) \geq f(x) + \nabla {f(x)}^\top (y-x)$.
    Essentially, it means that $f$'s graph is above any tangent plain.

-   Assume $f$ is twice differentiable. $f$ is convex iff $\dom{f}$ is
    convex and $\forall x \in \dom{f}\ \ \nabla^2 f(x) \succeq 0$.
    
## Operations preserving function convexity

-   Nonnegative linear combination.

-   Pointwise maximum. $\forall s \in S\ \ f_s$ is convex
    $\implies f(x) = \max\limits_S f_s(x)$ is also convex.

-   Partial minimum. $g(x,y)$ convex over variables $x, y$; $C$ convex.
    Then $f(x) = \min\limits_{y \in C} g(x,y)$ is also convex. E.g.,
    $f(x) = \max\limits_{y \in C} \left\lVert x-y \right\rVert$ or
    $f(x) = \min\limits_{y \in C} \left\lVert x-y \right\rVert$.
    

# Terminology

## Optimization problem

A convex optimization problem (program) 
\begin{equation}
    \begin{aligned}
    \min\limits_{x \in D} &\quad f(x) \\
    \text{subject to} &\quad g_i(x) \leq 0 & \forall i \in [1:m] \\
                    &\quad A x = b 
    \end{aligned}
\end{equation}
where $f, g_i$ are convex and $D = \dom{f} \cap \dom{g_i}$.

## Terms

\begin{equation}
    \begin{aligned}
    \min\limits_{x \in D} &\quad f(x) \\
    \text{subject to} &\quad g_i(x) \leq 0 & \forall i \in [1:m] \\
                    &\quad A x = b 
    \end{aligned}
\end{equation}

-   $f$ -- criteria or objective function.
-   $g_i$ -- inequality constraints.
-   $x$ is a *feasible point* if it satisfies the conditions, namely
    $x \in D$, $g_i(x) \leq 0$, and $A x = b$.
-   $\min f$ over feasible points points -- *optimal value* $f^*$.
-   If $x$ is feasible and $f(x) = f^*$ then $x$ is an *optimum*
    (solution, minimizer).
-   Feasible $x$ is a *local optimum* if $\exists R > 0$ such that
    $\forall y \in B_R(x)\ \ f(x) \leq f(y)$.
-   If $x$ is feasible and $f(x) \leq f^* + \varepsilon$ then $x$ is
    *$\varepsilon$-suboptimal*.
-   If $x$ is feasible and $g_k(x) = 0$ then $g_k$ is *active* at $x$
    (otherwise inactive).

## Properties

-   Solution set $X_{opt}$ is convex.
-   If $f$ is strictly convex then the solution is unique.
-   For convex optimization problems all local optima are global.
-   The set of feasible points is convex.

## Example: Lasso. 

$\min\limits_\beta \left\lVert y - X \beta \right\rVert_2^2$
subject to $\left\lVert \beta \right\rVert_1 \leq s$.

-   $g_1(\beta) = \left\lVert \beta \right\rVert_1 - s$ -- convex, no
    equality constraints.

-   $X$ is $n \times p$ matrix

    -   If $n \geq p$ and $X$ is full rank then
        $\nabla^2 f(\cdot) = 2 {X}^\mathsf{T} X$ is positive definite
        matrix. The function is strictly convex, therefore the solution
        is unique.

    -   If $p > n$ then $\exists \beta \neq 0$ such that $X \beta = 0$
        $\implies$ multiple solutions.
        

## First Order Condition 

-   convex problem with differentiable $f$

-   a feasible $x$ is optimal iff $\nabla f(x)^{T}( x - y) \ge 0$,
    $\forall$ feasible $y$

-   if unconstrained, the condition reduces to $\nabla f(x)  = 0$

$$\min\limits_x \frac{1}{2}x^{T}Q x + b^{T}x + c, \quad\quad Q \succeq 0$$
- FOC: $\nabla f(x) = Q^{T}x + b = 0$\
- if $Q \succ 0$ $\rightarrow$ $x^{*} = -Q^{-1}b$\
- if $Q$ singular, $b \notin Col[Q] \rightarrow$ no solution\
- if $Q$ singular, $b \in Col[Q] \rightarrow$ $x^{*} = -Q^{*}b + z$ with $z \in null[Q]$


# Useful operations

## Partial optimization

Recall: $h(x) = \min\limits_{y\in C}f(x,y)$ is convex if $f$ is convex, and $C$ is
convex.


\begin{equation}
\begin{aligned}
\min\limits_{x_{1},x_{2}} &\quad f(x_{1},x_{2}) &&\min\limits_{x_{1}}&\quad \tilde{f}(x_{1})\\
\textrm{s.t.}&\quad g_{1}(x_{1}) \le 0 &\Longleftrightarrow &\textrm{s.t.}&\quad 
g_{1}(x_{1}) &\le 0\\
&\quad g_2(x_2) \le 0
\end{aligned}
\end{equation}

where $\tilde{f}(x_1) = \min\Set{f(x_1,x_2): g_2(x_2)\le 0}$.

- The right problem is convex if the left is (and vice versa)

## Transformations

- We can use a monotone increasing transformation
$h: \mathbb{R} \rightarrow \mathbb{R}$:\
$$\quad \min\limits_{x\in C} f(x) \Rightarrow \min\limits_{x\in C} h(f(x))$$\
- We can use a change of variable transformation
$\phi : \mathbb{R}^{n} \Rightarrow \mathbb{R}^{m}$ :\
$$\quad \min\limits_{x \in C} f(x) \Leftrightarrow \min\limits_{\phi(y) \in C} f(\phi(y))$$

__Example:__ Geometric Program

$$\min\limits_{x \in C} f(x) = \sum\limits_{k=1}^{p}\gamma_{k}x_{1}^{a_{k_1}}x_{2}^{a_{k_{2}}} ..x_{n}^{a_{k_{n}}}\quad\quad
\textrm{(posynomial)}$$

- $C$ : involves (convex) inequalities in some form and equalities are affine.\
- We can change above non-convex problem to the following convex problem
by letting $y_{i} = \log x_{i}$ 

## Eliminate equality constraints

\begin{equation}
\begin{aligned}
\min\limits_{x}&\quad f(x) \\
\textrm{s.t.} &\quad g_{i}(x) \le  0\\ &\quad Ax = b
\end{aligned}
\end{equation}

- $x$ feasible means $\exists M : col[M]= null[A]$ and $x_0 \st Ax_{0} = b$ 
- Let $x=My + x_{0}$

Then the following is an equivalent problem:

\begin{equation}
\begin{aligned}
\min\limits_{y}&\quad f(My+x_0) \\
\textrm{s.t.} &\quad g_{i}(My+x_0) \le  0
\end{aligned}
\end{equation}


## Introduce slack variables

\begin{equation}
\begin{aligned}
\min\limits_{x}&\quad f(x) \\
\textrm{s.t.} &\quad g_{i}(x) \leq  0\\ &\quad Ax = b
\end{aligned}
\end{equation}

- Can add equality constraints:

\begin{equation}
\begin{aligned}
\min\limits_{x,s}&\quad f(x) \\
\textrm{s.t.} &\quad g_{i}(x) + s_i =  0\\
&\quad s_i \geq 0\\
&\quad Ax = b
\end{aligned}
\end{equation}

- But this is nonconvex unless $g_i$ are affine

__Relaxation__

We can relax nonaffine
constraints 
$$\min\limits_{x \in C} f(x) \Rightarrow \min\limits_{x \in \tilde{C} } f(x)$$
with $C \subset \tilde{C}$\

- In this case optimum of new problem is smaller or equal to the optimum
of the original problem.


# Standard problems 

## LP (Linear Programs) 

$$\min_{x} c^{T}x$$ with affine constraints

- Basis Pursuit (not an LP)
$\qquad \min\limits_{\beta} \| \beta \|_0$ s.t. $X \beta = y$\
- Above problem can be relaxed to :\
$\qquad \min\limits_{\beta} \| \beta \|_{1}$ s.t. $X \beta = y$.\
- This relaxation can be reformulated to a LP problem:\
$\qquad \min\limits_{\beta, z} 1^{T}z$ s.t.
$z \ge \beta, z \ge -\beta, X\beta = y$
- Dantzig selector\
$\qquad \min\limits_{\beta} \| \beta \|_{1}$ s.t.
$\|X^{\top}(y - X\beta)\|_{\infty} \le \lambda$

## QP (Quadratic program)

Lasso, ridge regression, OLS, Portfolio Optimization

## SDP (Semi-Definite program)

\begin{equation}
\begin{aligned}
\min\limits_{X \in S_{n}} &\quad \tr(C^{T}X)\\
\textrm{s.t.}&\quad \tr{A_{i}^{\top} X} = b_{i}\\
&\quad X \succeq 0
\end{aligned}
\end{equation}

## Conic program

\begin{equation}
\begin{aligned}
\min\limits_{x} &\quad c^\top x\\
\textrm{s.t.}&\quad Ax=b\\
&\quad D(x) + d \in K
\end{aligned}
\end{equation}


$D$ a linear mapping, $K$ a closed convex cone.

- Very similar to an LP

## Relations

$LP \subset QP \subset SOCP \subset SDP \subset CP(Conic Programming)$

# Duality

## Introduction

- Suppose we want to _Lower bound_ our convex program
- Find $B\leq \min_x f(x),\quad x\in C$.

__Example:__

\begin{equation}
\begin{aligned}
\min_{x,y} &\quad x + y\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned}
\end{equation}

- What is $B$?

## Harder

__Example:__

\begin{equation}
\begin{aligned}
\min_{x,y} &\quad x + 3y\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned}
\end{equation}

- What is $B$?

## Why?

__Example:__

\begin{equation}
\begin{aligned}
\min_{x,y} &\quad x + 3y\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned} \Longleftrightarrow
\begin{aligned}
\min_{x,y} &\quad (x + y)+ 2y\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned}
\end{equation}

- What is $B$?

## Harderer

__Example:__

\begin{equation}
\begin{aligned}
\min_{x,y} &\quad px + qy\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned}
\end{equation}

- What is $B$?

## Solution

\begin{equation}
\begin{aligned}
\min_{x,y} &\quad px + qy\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned} \Longleftrightarrow
\begin{aligned}
\min_{x,y} &\quad px + qy\\
\textrm{s.t.}&\quad ax+ay \geq 2a\\
&\quad bx, cy\geq 0\\
&\quad a,b,c \geq 0
\end{aligned}
\end{equation}

- Adding implies
$$ (a+b)x+(a+c)y\geq 2a$$ 
- Set $p=(a+b)$ and $q=(a+c)$ we get that $B=2a$

## Better

- We can make this lower bound bigger by maximizing:

\begin{equation}
\begin{aligned}
\max_{a,b,c} &\quad 2a\\
\textrm{s.t.}&\quad a+b = p\\
&\quad a+c=q\\
&\quad a,b,c \geq 0
\end{aligned}
\end{equation}

- This is the __Dual__ of the __Primal__ LP
\begin{equation}
\begin{aligned}
\min_{x,y} &\quad px + qy\\
\textrm{s.t.}&\quad x+y \geq 2\\
&\quad x,y\geq 0
\end{aligned}
\end{equation}

- Note that the number of Dual variables (3) is the number of Primal constraints

## General LP

\begin{equation}
\begin{aligned}
&\quad \textrm{(P)}\\
\min_x &\quad c^\top x\\
\textrm{s.t} &\quad Ax=b\\
&\quad Gx\leq h
\end{aligned}
\quad\quad
\begin{aligned}
&\quad \textrm{(D)}\\
\max_{u,v} &\quad -b^\top u - h^\top v  \\
\textrm{s.t} &\quad -A^\top u -G^\top v=c\\
&\quad v \geq 0
\end{aligned}
\end{equation}

__Exercise__

## Alternate derivation
\begin{equation}
\begin{aligned}
\min_x &\quad c^\top x\\
\textrm{s.t} &\quad Ax=b\\
&\quad Gx\leq h
\end{aligned}
\end{equation}

- Suppose that some $x$ is feasible.
- Then, for that $x$,
$$
c^\top x \geq c^\top x + u^\top (Ax-b) + v^\top (Gx-h) =: L(x,u,v).
$$
as long as $v\geq 0$ and $u$ is anything.
- We call $L(x,u,v)$ the __Lagrangian__.

Now, suppose $C$ is the feasible set, and $f^*$ is the optimum
$$
f^* \geq \min_{x\in C} L(x,u,v) \geq \min_x L(x,u,v) =: g(u,v)
$$

- We call $g(u,v)$ the __Lagrange Dual Function__
- Maximizing $g(u,v)$ is the Dual problem.

## Weak duality

Consider the generic (primal) convex program
\begin{equation}
\begin{aligned}
\min_x &\quad f(x)\\
\textrm{s.t} &\quad l_i(x)=0\\
&\quad h_i(x)\leq 0
\end{aligned}
\end{equation}

- For feasible $x$, $v\geq 0$
$$
f(x) \geq f(x) + u^\top h(x) + v^\top l(x) \geq \min_x L(x,u,v) = g(u,v).
$$
- Therefore,
$$
f^* \geq \max_{\forall u, v\geq 0} g(u,v) = g^*.
$$
- This is __weak duality__.
- Note that the Dual Program is always convex even if P is not (pointwise max of linear functions)

## Strong duality

$$
f^* = g^*
$$

- Sufficient conditions for strong duality: __Slater's conditions__
- If _P_ is convex and there exists $x$ such that for all $i$, $h_i(x) < 0$ (strictly feasible), then we have strong duality. (Extension: strict inequalities for $i$ when $h_i$ not affine.)
- Sufficient conditions for strong duality of an LP: strong duality if either _P_ or _D_ is feasible. (Dual of _D_ = _P_)

## Example

Dual for Support Vector Machine

\begin{equation}
\begin{aligned}
&\quad \textrm{(P)}\\
\min_{\xi,\beta,\beta_0} &\quad \frac{1}{2}\norm{\beta}_2^2 +C\indicator^\top\xi\\
\textrm{s.t} &\quad \xi_i \geq 0\\
&\quad y_i(x_i^\top \beta+\beta_0)\geq 1-\xi_i
\end{aligned}
\quad\quad\quad
\begin{aligned}
&\quad \textrm{(D)}\\
\max_{w} &\quad -\frac{1}{2}w^\top\tilde{X}^\top\tilde{X}w +\indicator^\top w\\
\textrm{s.t} &\quad 0\leq w \leq C\indicator\\
&\quad w^\top y = 0
\end{aligned}
\end{equation}
where $\tilde{X} = \textrm{diag}(y)X$.

- $w=0$ is Dual feasible. (Don't need strict, because affine)
- We have strong duality by Slater's conditions.

## KKT conditions

\begin{equation}
\begin{aligned}
\min_x &\quad f(x)\\
\textrm{s.t} &\quad l_i(x)=0\\
&\quad h_i(x)\leq 0
\end{aligned}
\end{equation}


1. Stationarity: $0 \in \partial(f(x)+v^\top h(x)+u^\top l(x))$: For some pair $(u,v)$, $x$ minimizes the Lagrangian.
2. Complementary slackness: $v_ih_i(x)=0 \,\,\, , \forall i$
3. Primal feasibility: $h_i(x) \le 0 \,\,,\,\, l_i(x)=0$
4. Dual feasibility: $v \ge 0$

__Theorem__: Solutions $x^*$ and $(u^*,v^*)$ Primal-Dual optimal and $f^*=g^*$, then they satisfy the KKT conditions.

__Theorem__: Solutions $x^*$ and $(u^*,v^*)$ that satisfy the KKT conditions are Primal-Dual optimal.


__Example (SVM cont.)__ (eliminated the $v$ from Dual problem)

1.  Stationarity: $w^\top y=0 \,\,,\,\, \beta=w^\top \tilde{X} \,\,,\,\, w=C1-v$
2.  CS: $v_i\zeta_i=0 \,\,,\,\, w_i(1-\zeta_i-y_i(x_i^\top \beta+\beta_0))=0$
3. Clear.
4. Clear.

## Constraints and Lagrangians

When are the two following forms equivalent?

constrained form (C): \[\begin{aligned}
        \min f(x)\\
        s.t. \,\, h(x) \le t
    \end{aligned}\]

Lagrangian form (L): $$\begin{aligned}
    \min f(x)+\lambda h(x)
    \end{aligned}$$

When C is strictly feasible, strong duality holds. So there exists
$\lambda$ such that for each $x$ that solves C those $x$ minimize L.

Now, if $x^*$ solves L, then KKT condition for C hold by taking
$t=h(x^*)$ and so $x^*$ is a solution of C.


# Algorithms

## First order methods

For simplicity, consider unconstrained optimization 
\begin{equation}
\min_x f(x)
\end{equation}
assume $f$ is convex and differentiable

__Gradient descent__

-   Choose $x^{(0)}$
-   Iterate $x^{(k)} = x^{(k-1)} - t_k\nabla f(x^{(k-1)})$
-   Stop sometime

__Why?__ 

- Taylor expansion
\begin{equation}\begin{aligned}
f(y) \approx f(x) + \nabla f(x)^{\top}(y-x) + \frac{1}{2}(y-x)^\top H (y-x)
\end{aligned}\end{equation}
- replace $H$ with $1/t$
- Choose $y=x^{+}$ to minimize the quadratic approximation 

## What t?

What to use for $t_k$? 

__Fixed__

- Only works if $t$ is exactly right 
- Usually does not work

__Sequence__ 

\begin{equation}\begin{aligned}
t_k \quad s.t.\sum_{k=1}^{\infty} t_k = \infty ,\quad
              \sum_{k=1}^{\infty} t^{2}_k < \infty
\end{aligned}\end{equation}

__Backtracking line search__

At each iteration choose best $t$

1. Set $0 <\beta < 1 , 0 < \alpha <\frac{1}{2}$
2. At each $k$, while
\begin{equation}f\left(x^{(k)} - t\nabla f(x^{(k)})\right) > f(x^{(k)}) - \alpha t \norm{ f(x^{(k)})}^{2}_{2}\end{equation}
set $t = \beta t$ (shrink t)
3. $x^{t+1} = x - t \nabla f(x_t)$ 


__Exact line search__

- Backtracking approximates this
- At each $k$, solve
\[
t = \arg\min_{s >= 0} f( x^{(k)} - s f(x^{(k-1)})) 
\] 
- Usually can't solve this.

## Convergence

If $\nabla f$ is Lipschitz, use fixed $t$

1. GD converges at rate $O(1/k)$
2. $\epsilon$-optimal in $O(1/\epsilon)$ iterations

If $f$ is strongly convex as well

1. GD converges at rate $O(c^k)$ ($0<c<1$).
2. $\epsilon$-optimal in $O(\log(1/\epsilon))$ iterations

We call this second case "linear convergence" because it's linear on the $\log$ scale.

## Stochastic Gradient Descent

- Workhorse in Neural Networks
- Suppose $$f(x) = \frac{1}{m} \sum_{i=1}^m f_i(x)$$
- GD would use 
$$x^{(k)} = x^{(k-1)} - \frac{t}{m} \sum_{i=1}^m \nabla f_i(x^{(k-1)})$$
- SGD uses
$$x^{(k)} = x^{(k-1)} - t \nabla f_{i_k}(x^{(k-1)})$$
$i_k\in\{1,\ldots,m\}$ is an index chosen at time $k$.

Standard choices for $i_k$

1. Draw $i_k\sim \textrm{Unif}(\{1,\ldots,m\})$.
2. Set $i_k=1,\ldots,m,1,\ldots,m,1,\ldots,m,\ldots$

Most frequently update "mini-batches". Why?

1. Convergence rates are slower for SGD than GD.
2. Batches improves this.
3. SGD is computationally more efficient in per-iteration cost, memory
4. Efficient when far from the optimum, less good close to the optimum (but statistical properties don't care when you're close)

## Subgradient descent

What happens when $f$ isn't differentiable?

A __subgradient__ of convex $f$ at a point $x$ is any $g$ such that
$$f(y) \geq f(x) + g^\top (y-x)\quad\quad \forall y$$

- Always exists
- $f$ differentiable $\Rightarrow g=\nabla f(x)$ (unique)
- Just plug in the subgradient in GD
- $f$ Lipschitz gives $O(1/\sqrt{k})$ convergence

## Example (LASSO)

$$\min_{\beta} \frac{1}{2}\norm{y - X\beta}_2^2 + \lambda\norm{\beta}$$

- Subdifferential (the set of subgradients)
$$\begin{aligned}
    g(\beta) &= -X^T(y - X\beta) + \lambda\partial\norm{\beta}_1 \\
    & =-X^\top(y - X\beta) + \lambda v \\
    v_i &=
    \begin{cases}
        \{1\}      & \quad \text{if } \beta_i > 0\\
        \{1\}      & \quad \text{if } \beta_i < 0\\
        [-1,1]     & \quad \text{if } \beta_i = 0\\
    \end{cases}
\end{aligned}$$

So $sign(\beta) \in \partial\norm{\beta}_1$

- From KKT (stationarity)
$$\begin{aligned}
    X^T(y - X\beta) &=\lambda v   \\
    X_i^{T}(y - X\beta) &= \lambda v_i \\
    \end{aligned}$$
- Therefore
$$\begin{aligned}
    |X_i^T(y - X\beta)|=\lambda  &\Rightarrow \beta_i \neq 0 \\
    |X_i^T(y - X\beta)| < \lambda &\Rightarrow \beta_i = 0
\end{aligned}$$ 
- Checking the KKT conditions underlies the LARS algorithm

## Proximal gradient descent 

Suppose $f$ is __decomposable__
$$f(x) = g(x) + h(x)$$

- We assume $g$ is convex and differentiable, but $h$ is convex only
- Approximate $g$ only (via Taylor with approximate Hessian)
$$\begin{aligned}
x^{+} &= \arg\min_z g(x) + \nabla g(x)^\top (z-x) + \frac{1}{2t}\norm{z-x}_2^2 + h(z)\\
    &= \arg\min_z \frac{1}{2t}\norm{z-(x - t\nabla g(x))}_2^2 + h(z) \\
\end{aligned}$$
- The first part keeps us close to the gradient update for $g$, but still minimize over $h$.
- Define
$$\textrm{prox}_t(x) := \arg\min_z \frac{1}{2t}\norm{x-z}_2^2 + h(z)$$
- The prox operator depends only on $h$, not $g$
- Easily solvable for many $h$
- Update becomes
$$x^{(k)} = \textrm{prox}_{t_k}\left(x^{(k-1)} - t_k\nabla g(x^{(k-1)})\right)$$

## Example (LASSO)

$$\min_\beta \underbrace{\frac{1}{2}\norm{y-X\beta}_2^2}_g + \underbrace{\lambda\norm{\beta}_1}_h$$

$$\begin{aligned}
\textrm{prox}_t(\beta) &= \arg\min_z \norm{z-\beta}_2^2 + \lambda\norm{z}_1\\
&=: S_{\lambda t}(\beta)\\
S_{\tau}(\beta) &= \textrm{sign}(\beta)(|\beta|-\tau)_+
\end{aligned}$$

So the update becomes:

1. $\beta \leftarrow \beta + t X^\top(y-X\beta)$
2. $\beta \leftarrow S_{\lambda t}(\beta)$

This is called "Iterative soft thresholding" (ISTA)

## Projected gradient descent

$$
\min_{x\in C} f(x)
$$

- Incorporate the constraint into the objective using the indicator function:
$$
I_C(x) = \begin{cases} 0 & x \in C\\ +\infty & x \not\in C \end{cases}.
$$
- Call the indicator $h$, and use proximal GD
- Now $\textrm{prox}_t(x) = \arg\min_{z\in C} \norm{z-x}_2^2$
- This is just the Euclidean projection onto $C$


So the update becomes:

1. Use GD on $f$
2. Project onto $C$

# Dual methods

## Dual (sub)gradient ascent/descent

Primal problem:
$$\min_x f(x) \st Ax=b.$$ 

Dual is
$$\max_u -f^*(-A^\top u)+b^\top u,$$ 
where $f^*$ is the __conjugate__ of $f$. 

- Let
$g(u):=f^*(-A^\top u)-b^\top u$. 
- Our goal is to minimize $g(u)$. 
- The subdifferential is given by
$$\partial g(u)= A\partial f^*(-A^\top u)-b^\top u,$$ but if
$x\in\arg\min_z L(z,u)$ then $\partial g(u)={Ax-b}$. 
- We may solve this as
follows:
    1. guess initial $u^{(0)}$ 
    2. choose $x^{(k)}\in\arg\min_x f(x)+(u^{(k-1)})^\top Ax$
    3. Update $u^{(k)} = u^{(k-1)}+t_k\left(Ax^{(k)}-b\right)$

Formally: if $f$ is strictly convex then

1.  conjugate $f^*$ is differentiable
2.  procedure is dual gradient ascent
3.  $x^{(k)}$ is the unique minimizer

We can choose $t_k$ as before and apply proximal methods (or
acceleration).

## Decomposable dual

 \[ \min_x \sum_{i=1}^m f_i(x_i) \st Ax=b \]
  standard minimization decomposes into $x^+\in\argmin_x\sum_{i=1}^m f_i(x_i) + u^\top Ax$, which is equivalent to solving separately for each $x_i$:
  \[ x^+_i\in\arg\min_{x_i} f_i(x_i)+u^\top A_i x_i. \]
So we can iterate:
$$
\begin{aligned}
x_i^{(k)}&\in\arg\min_{x_i}f(x_i)+(u^{(k-1)})^TA_ix_i\\ 
u^{(k)} &= u^{(k-1)}+t_k\left(\sum_{i=1}^m A_ i x_i^{(k)}-b \right)
\end{aligned}
$$

- Strong duality holds in this particular example since we have no inequality constraints.
- If the constraints are inequalities, i.e. $Ax\leq b$, we make a slight modification to $u^{(k)}$:
\[
u^{(k)} = \left(u^{(k-1)}+t_k\left(\sum_{i=1}^m A_i x_i^{(k)}-b \right)\right)_+
\]
- Updates can be parallelized.

## Augmented Lagrangian

- For for dual ascent to work ($\to g^*$), we need $f$ "nice" 
- To achieve strong duality (primal optimality) the program must also satisfy one of the conditions mentioned earlier (e.g. Slater's condition).


We can alter the problem to enforce strong convexity

- Use $\min_xf(x)+\frac{\rho}{2}||Ax-b||_2^2 \st Ax=b$.  
- The objective is strongly convex if $A$ has full column rank.  
- Dual gradient ascent then becomes
    $$\begin{aligned}
      x^{(k)}&=\arg\min_x f(x)+(u^{(k-1)})^\top Ax+\frac{\rho}{2}\norm{Ax-b}_2^2\\ 
    u^{(k)} &= u^{(k-1)}+\rho(Ax^{(k-1)}-b) 
    \end{aligned}$$
- Replacing the step size $t_k$ with $\rho$ gives better convergence properties than the original DGA.  
- But introducing the norm kills decomposability (if we had it), no parallelization
- $\rho$ balances primal feasibility with a small objective 
- larger $\rho$ deemphasizes objective but forces $x^{(k)}$ toward primal feasiblity

## Alternating Direction Method of Multipliers (ADMM)

\[\min_{x,z}f(x)+g(z) \st Ax+Bz=c\]
Add $\frac{\rho}{2}\norm{Ax+Bz-c}_2^2$ to the objective, penalizing infeasibility:
\[
L_\rho(x,z,u)=f(x)+g(z)+u^T(Ax+Bz-c) + \frac{\rho}{2}\norm{Ax+Bz-c}_2^2
\]

Iteratively update:
$$\begin{aligned}
  x^{(k)}&=\arg\min_xL_\rho(x,z^{(k-1)},u^{(k-1)})\\
z^{(k)}&=\arg\min_zL_\rho(x^{(k-1)},z,u^{(k-1)})\\
u^{(k)} &= u^{(k-1)}+\rho(Ax^{(k-1)}+Bx^{(k-1)}-b)\\
\end{aligned}$$


Properties of ADMM (some of which do not require $A$ and $B$ to be full
rank):

1.  $Ax^{(k)}+Bz^{(k)}-c\to 0$ as $k\to\infty$ (primal feasibility)
2.  $f^{(k)}+g^{(k)}\to f^*+g^*$ (primal optimality)
3.  $u^{(k)}\to u^*$ (dual solution)
4.  doesn't necessarily give $x^{(k)}\to x^*$ and $z^{(k)}\to z^*$

- The exact convergence rate is unknown, but empirically seems close to
$O(1/\epsilon)$.


## Example (LASSO)

$$\min_\beta \frac{1}{2}\norm{y-X\beta}_2^2 +\lambda\norm{\alpha}_1 \st \alpha=\beta$$
ADMM update (compare with ridge regression): $$\begin{aligned}
  \beta^{(k)}&=\left(X^TX+\rho I\right)^{-1}\left(X^Ty+\rho(\alpha^{(k-1)}-w^{(k-1)})\right)\\ 
  \alpha^{(k)}&=S_{\lambda/\rho}(\beta^{(k)}+w^{(k-1)})\\
  w^{(k)}&=w^{(k-1)}+\beta^{(k)}-\alpha^{(k)}\\\end{aligned}$$
  
- There's an "equivalence" between $\beta$ and $\alpha$. You use $\alpha$ as the solution.

Issues with ADMM:

-   How to choose $\rho$.
-   Different ADMM formulations of the problem may have different
    convergence properties.
    
## Coordinate descent

- Works well with LASSO.
- If $f(x)=g(x)+\sum_{i=1}^nh_i(x_i)$ where $g$ is convex and differentiable, $h$ merely convex
- Then:
1. Guess $x^{(0)}$.  
2. Update according to:
$$\begin{aligned}
x_1^{(k)}&\in \arg\min_{x_1} f(x_1,x_2^{(k-1)},...,x_n^{(k-1)})\\
x_2^{(k)}&\in \arg\min_{x_2} f(x_1^{(k)},x_2,...,x_n^{(k-1)})\; \text{ (minimize over whole vector or block)}\\
&\quad \vdots
\end{aligned}$$

## Example (LASSO)

This is what `GLMNET` does
$$\min_\beta \frac{1}{2}\norm{y-X\beta}_2^2 +\lambda\norm{\beta}_1 $$

- $\norm{\beta}_1 = \sum_{i=1}^p |\beta_i|$
- Iterate over $p$ 
$$\beta_i\leftarrow S_{\lambda/\norm{X_i}_2^2}\left(\frac{X_i^T(y-X_{-i}\beta_{-i})}{X_i^\top X_i}\right)$$
- Comes from taking derivative of objective wrt $\beta_i$

## References