Notes of Mathematics

2019-06-19

泛函

不动点定理

不动点定理的基本逻辑：对于一个存在性问题，构造一个度量空间和一个映射，使得存在性问题等价于这个映射的不动点。只要证明这个映射存在不动点，那么原来的存在性问题即得证。

链接

紧性的利用

证明存在性
在无限维空间中“模仿”有限维的欧式空间

紧集是为了模仿描述欧式空间中的有界闭集合么？紧=相对紧+闭

流形(Manifolds)

概念：高维空间中曲线、曲面概念的推广，如三维空间中的曲面为一二维流形。

支撑集(Support)

概念：函数的非零部分子集；一个概率分布的支撑集为所有概率密度非零部分的集合

数学分析

Lipschitz连续

若存在一个常数K，使得定义域内的任意两点x1,x2满足： $ |f(x_1)-f(x_2)|=K|x_1-x_2|$ 则称函数为Lipschitz连续函数。此性质限定了f的导函数的绝对值不超过K，规定了函数的最大局部变动幅度。

Statistical learning theory

Introduction

Determine how well a model performs on unseen data

Preliminary

Markov Inequality

, in the order of $O (\frac{1}{d e v i a t i o n})$
Chebyshev's Inequality

, in the order of $O (\frac{1}{d e v i a t i o n^{2}})$
Generic Chernoff's Bound
Hoeffding's Inequality
- Hoeffding's lemma
- Hoeffding's Inequality
  
  Hoeffding's inequality is useful to bound the probability of the gap between an empirical value and the true expectation of an average of bounded random variables.
McDiarmid's Inequality
- A concentration inequality.
- This bound is useful because if we prove that an algorithm is β stable then we will have this property on a specific function.

PAC

Check here for PAC, VC, uniform bound and others.

PAC Learning (agnostic PAC learnable)--finite hypothesis class
- Consider a simple linear classifier with 2 weights $\vec{w} = (w_{1}; w_{2})$ , which are stored using a 32 bit floats. This implies that the hypothesis class is finite with $|| = 2^{32} $.
- This theorem works on finite hypothesis class, and answers that for a hypothesis class $H$ , to make the generation gap is smaller than $ϵ$ with at least $1 - δ$ , the required samples complexity.
- Once a hypothesis class is PAC learnable, with high probability the training set is $ϵ$ -representative.
- Note Suppose $H$ is PAC learnable, there is not a unique function $m_{H}$ that satisfies the requirements given in the definition of PAC learnability.
- Finite classes are PAC Learnable, also agnostic PAC Learnable.
Uniform convergence
- Formalize that over all hypothesis in $H$ , the empirical risk is close to the true risk. This will make sure the ERM to work.
VC dimension
- measure the complexity of hypothesis class other than cardinality.
- Shattering
- VC Dimension
  
  Informally VC dimension is the maximum number of distinct points that a hypothesis in $H$ can correctly classify every possible labeling with zero error.
  - With infinite VC dimension, the hypothesis class won't be PAC learnable.
  - There exist hypothesis classes with uncountable cardinality but finite VC dimension.
  - For every two hypothesis classes if $H_{0} \subset H$ then $V C d i m (H_{0}) \leq V C d i m (H)$ .

Occam's bound

The PAC bound can be treated as Occam's bound with a uniform prior .
Occam's bound will put a distribution over the countably infinite hypothesis class $H$ that is independent of dataset $S$ we will receive. In doing so we will get bounds on the generalization gap that no longer depend on the size of the hypothesis class, $| H |$ . These bounds now become variable depending on how we weigh each individual hypothesis $h$ , i.e. $P (h)$ .
- Regularizers offers higher probability assigned to $\vec{w}$ near the origin and thus a tighter bound, it won't influence the algorithm (loss). Specifically , when an $ℓ_{2}$ regularization term is added to the learning algorithm, it adds concavity to the loss function.
, where $D (Q ‖ P)$ is the KL divergence which serves as a complexity measure.
- It tells that with a good posterior that is close to the prior, then the KL-divergence will become smaller and out bound will be tighter. Even though it's tight, the bound is tight for hypothesis that we may not care about, e.g. tight on the bound with respect to $P$ prior on hypothesis.
- Note that the posterior is after applying the prior $P$ on hypothesis, and seeing the data, then one can get this posterior $Q$ .
- Why posterior?
  
  different choices of prior and posterior hypotheses can be made, each resulting in a new bound without us touching the algorithm.
- The dropout PAC-Bayes is a lower bound on the PAC-Bayes bound that becomes tight when the dropout factor is 0.

Stability, Generalization

PAC learning and Occam's bound work as algorithm-agnostic bounds.
Stability
- Hint: a change in data distribution does not change the predictions.
- Definition: uniform stability
  
  An algorithm with this property can be understood as one that produces a hypothesis such that the loss function $ℓ$ is not drastically affected by perturbing the dataset in this manner.
- EMR with regularization is $β$ -stable.
  - If we perturb the data by a single element, we learn $A$ that can become arbitrarily close for large $n$ .
- SGD is stable
  - It holds for a finite number of steps $T$
Defect:

$D [h_{S}] = R [h_{S}] - {\hat{R}}_{S} [h_{S}]$
- Defect $D [h_{S}]$ for a hypothesis $h_{S}$ derived from an algorithm after seeing the dataset $S$ is defined as the difference between the population risk and the empirical risk.
- It's expectation is not zero.
- The expectation value of defect can be bounded under certain conditions.
  
  If $ $ is a $β$ -uniformly stable algorithm, then $- β \leq E [D [h_{S}]] \leq β$ .
- Let $A$ be a $β$ -uniformly stable learning algorithm with respect to a loss function $ℓ : Y \times Y \to [0 ， M]$ . The absolute difference of the defect calculated on a dataset $S$ and on a perturbed version of the dataset $S^{i, z}$ is bounded by $| D [h_{S}] - D [h_{S^{i, z}}] | \leq 2 β + \frac{M}{n}$ .
For a $β$ -uniformly stable algorithm, the relationship between the empirical and the population risk is $E [R [h_{S}]] \leq E_{S} [{\hat{R}}_{S} [h_{S}]] + β$ .
- It's a bound on the expectation value of the population risk. But this bound does not hold for all possible $h_{S}$ .
For a $β$ -uniformly stable algorithm $A$ with respect to a loss function $ℓ : Y \times Y \to [0 ， M]$ and a hypothesis $h_{S}$ with $| S | = n$ . The relationship between the empirical and the population risk holds with probability $1 - δ$ : $R [h_{S}] \leq {\hat{R}}_{S} [h_{S}] + β + (n β + \frac{M}{2}) \sqrt{\frac{2 \log \frac{2}{δ}}{n}}$ .
- The last term is a concentration Inequality (McDiramid's)
- For this bound, as $n$ goes up, it becomes less tight.

Convex Optimization

Book Convex Optimization, Convex Optimization: Algorithms and Complexity

Terminology	Definition
Convex function: origin	$f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y)$
Convex function: 1st order (once differential )	$f(y)f(x)+f(x)^T (y-x) $
Convex function: 2nd order (twice differential )	$\nabla^{2} f (x) ⪰ 0$ , the eigenvalues won't be negative
$L$ -Lipschitz	$‖ f (x) - f (y) ‖ \leq L ‖ x - y ‖$
$β$ -smooth: $β$ -Lipschitz on gradient	$‖ \nabla f (x) - \nabla f (y) ‖ \leq β ‖ x - y ‖ \Rightarrow \nabla^{2} f (x) ⪯ β I$
$α$ -strong convex, limited the domain mostly.	$f (y) - f (x) \leq \nabla f (x)^{T} (y - x) - \frac{α}{2} ‖ y - x ‖^{2} \Rightarrow \nabla^{2} f (x) ⪰ α I$

Table: Optimizer in different conditions
For each optimizer, from the top line downwarding, the rate of convergence is increasing. The optimal step size is gotten by minimizing the bound.

Optimizer	Condition	Converge rate	Optimal step size	Sub-optimal gap	Bounds of the gap
GD after $T$ steps	L-Lipschitz convex	$O (\frac{1}{\sqrt{T}})$	$γ = \frac{‖ x_{1} - x^{*} ‖_{2}}{L \sqrt{T}}$	$f (\frac{1}{T} \sum_{k = 1}^{T} x_{k}) - f (x^{*})$	$\leq \frac{‖ x_{1} - x^{*} ‖ L}{\sqrt{T}}$ , the initial point matters
GD	$β$ -smooth +convex	$O (\frac{1}{T})$	$γ = \frac{1}{β}$ , constant and independent of $T$	$f (x_{k}) - f (x^{*})$ Notice the average on all samples can be erased	$\leq \frac{2 β ‖ x_{1} - x^{*} ‖^{2}}{k - 1}$ , $k$ means the $k$ step. Bound depends on initial point
Projected subGD after $T$ steps	$α$ -strong + $L$ -Lipschitz	$O (\frac{1}{T})$	$γ_{k} = \frac{2}{α (k + 1)}$ , diminish at every step	$f (\sum_{k = 1}^{T} \frac{2 k}{T (T + 1)} x_{k}) - f (x^{*})$	$\leq \frac{2 L^{2}}{α (T + 1)}$
GD	$λ$ -strong + $β$ -smooth	$O (\exp (- T))$	$γ = \frac{2}{λ + β}$	$f (x_{t + 1}) - f (x^{*})$ Notice the average on all samples can be erased	$\leq \frac{β}{2} \exp (- \frac{4 t}{κ + 1}) ‖ x_{1} - x^{*} ‖^{2}$ , $ = $. < b r / >$ k$ is the same meaning of $t$ , rather than the $κ$ in denominator.
Polyak (heavy ball)	Quadratic loss	$O ((\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1})^{t}) \approx \exp (- \frac{C}{\sqrt{κ}})$ , $κ = \frac{h_{m a x}}{h_{m i n}}$	$γ^{*} = \frac{(1 + \sqrt{μ})^{2}}{h_{m a x}} = \frac{(1 - \sqrt{μ})^{2}}{h_{m i n}}$	${‖ [\begin{matrix} x_{t + 1} - x * \\ x_{t} - x^{*} \end{matrix}] ‖}_{2}$	$\leq O (ρ (A)^{T}) = O ({\sqrt{μ}}^{T})$ . $μ$ is the curvature
Nesterov NAG	$β$ -smooth	$O (\frac{1}{T^{2}})$		$f (y_{t}) - f (x^{*})$
Nesterov NAG	$α$ -strong + $β$ -smooth	$O (\exp (- \frac{T}{\sqrt{κ}}))$		$f (y_{t}) - f (x^{*})$	$\leq \frac{α + β}{2} ‖ x_{k} - x^{*} ‖^{2} \exp (- \frac{t - 1}{\sqrt{κ}})$
SGD	$L$ -Lipschitz by $‖ \tilde{g} (x) ‖ \leq L$ with prob. 1	$O (\frac{1}{\sqrt{T}})$ If wants a $ϵ$ -tolerance, $T \geq \frac{B^{2} L^{2}}{ϵ^{2}}$	$γ = \frac{B}{L \sqrt{T}}$	$E [f (\bar{x})] - f (x^{})$ , where $x^{} \in \arg min_{x : ‖ x ‖ < B} f (x)$	$\leq \frac{B L}{\sqrt{T}}$
SGD	$α$ -strong+ $E ‖ \tilde{g} (x) ‖_{*}^{2} \leq B^{2}$ (kind of $B$ -smooth)	$O (\frac{1}{T})$	$γ = \frac{2}{α (s + 1)}$	$f (\sum_{s = 1}^{t} \frac{2 s}{t (t + 1)} x_{s}) - f (x^{*})$	$\leq \frac{2 B^{2}}{α (t + 1)}$

Estimate sequence

Definition
Properties

for any sequence ${λ_{k}}$ , satisfying $(2.2 .2)$ we can derive the rate of convergence of the minimization process directly from the rate of convergence of the sequence ${λ_{k}}$ .

Note: below in estimate sequence, all $L$ means $L$ -smooth function
How to form an estimate sequence?
How to ensure $(2.2 .2)$ ?
- Method 1: Do scheme (2.2.6), it will generate a sequence ${x_{k}}_{k = 0}^{\infin}$ such that $f (x_{k}) - f^{*} \leq λ_{k} [f (x_{0}) - f^{*} + \frac{γ_{0}}{2} ‖ x_{0} - x^{*} ‖^{2}]$ . It will make sequence satisfy $(2.2 .2)$
- Method 2: Using gradient step
Some cases
- If in the scheme (2.2.6) we choose $γ_{0} \geq μ$ , then $λ_{k} \leq min {(1 - \sqrt{\frac{μ}{L}})^{k}, \frac{4 L}{(2 \sqrt{L} + k \sqrt{γ})^{2}}}$ .
- If in the scheme (2.2.6) we choose $γ_{0} = L$ , then this scheme generates a sequence ${x_{k}}_{k = 0}^{\infin}$ such that $f (x_{k}) - f^{*} \leq L min {(1 - \sqrt{\frac{μ}{L}})^{k}, \frac{4}{(k + 2)^{2}}} ‖ x_{0} - x^{*} ‖^{2}$ . This means that it's optimal for the class $S_{μ, L}^{1 ， 1} (R^{n})$ with $μ \geq 0$ .
- If in the scheme $(2.2 .8)$ we choose $α_{0} \geq \sqrt{\frac{μ}{L}}$ , then this scheme generates a sequence ${x_{k}}_{k = 0}^{\infin}$ such that $f (x_{k}) - f^{*} \leq min {(1 - \sqrt{\frac{μ}{L}})^{k}, \frac{4 L}{(2 \sqrt{L} + k \sqrt{γ_{0}})^{2}}} [f (x_{0}) - f^{*} + \frac{γ_{0}}{2} ‖ x_{0} - x^{*} ‖^{2}]$ , where $γ_{0} = \frac{α_{0} (α_{0} L - μ)}{1 - α_{0}}$ . Here $α_{0} \geq \sqrt{\frac{μ}{L}}$ is equivalent to $γ_{0} \geq μ$ .

Convex sets

Affine sets

Affine sets:
- Definition : A set $C \subseteq R^{n}$ is affine if the line through any two distinct points in $C$ lies in $C$ , aka for any $x_{1}, x_{2} \in C$ and $θ \in R$ , one has $θ x_{1} + (1 - θ) x_{2} \in C$ . It indicates that the $C$ contains the linear combination of any two points in $C$ .
- Induction: If $C$ is an affine set, $x_{1} ， \dots, x_{k} \in C$ and $θ_{1} + \dots + θ_{k} = 1$ , then the point $θ_{1} x_{1} + \dots + θ_{k} x_{k}$ also belongs to $C$ .
Affine hull
- Definition: the set of all affine combinations of points in some set $C \subseteq R^{n}$ , denoted as $aff C$
- It's the smallest affine set that contains $C$
Affine dimension
- Definition: as the dimension of its affine hull.
- E.g.: ${x \in R^{2} | x_{1}^{2} + x_{2}^{2} = 1}$ , the affine dimension is 2.

Convex sets

Definition: If every point in the set can be seen by every other point, along an unobstructed straight path between them, where unobstructed means lying in the set.
Every affine set is also convex.
Convex hull, denotes as $conv C$ , is the set of all convex combinations of points in $C$ . It is always convex, and it's the smallest convex set that contains $C$
More generally, suppose $p : R^{n} \to R$ satisfies $p (x) \geq 0$ for all $x \in C$ and $\int_{C} p (x) d x = 1$ , where $C \subseteq R^{n}$ is convex, then $\int_{C} p (x) x d x \in C$ , if the integral exists.

Convex functions

Definition
All affine function are both convex and concave.
$f$ is convex if and only if for all $x \in dom f$ and all $v$ , the function $g (t) = f (x + t v)$ is convex.

Extended-value extensions

First-order conditions

The inequality (3.2) states that for a convex function, the first-order Taylor approximation is in fact a global underestimator of the function. Conversely, if the first-order Taylor approximation of a function is always a global underestimator of the function, then the function is convex.

The inequality (3.2) shows that if $\nabla f (x) = 0$ , then for all $y \in dom f, f (y) \geq f (x)$ , i.e., $x$ is a global minimizer of the function $f$ .

Second-order conditions

More constraints on convex function

$β$ -smooth

Definition: a continuously function $f$ is $β$ -smooth if the gradient $\nabla f$ is $β$ -Lipschitz, that is $‖ \nabla f (x) - \nabla f (y) ‖ \leq β ‖ x - y ‖$ .
If $f$ is twice differentiable then this is equivalent to the eigenvalues of the Hessians being smaller than $β$ at any point. , $\nabla^{2} f (x) ⪯ β I_{n}, \forall x$ .
smoothness removes dependency from the averaging scheme
If extend the $β$ -smooth to multi power, it's called Holder condition
The bigger your function changes in gradients, the upper you have to explore.

$α$ -strong convexity

Strong convexity can significantly speed-up the convergence of first order methods.

Definition

We say that $f : X \to R$ is a $α$ -strongly convex if it satisfies the following improved subgradient inequality:

$f (x) - f (y) \leq \nabla f (x)^{T} (x - y) - \frac{α}{2} ‖ x - y ‖^{2}$ . A large value of $α$ will lead to a faster rate. A $α$ -strong convexity for twice differential function $f$ can also be interpreted as $\nabla^{2} f (x) ⪰ α I_{n}, \forall x$ .
Strong convexity plus $β$ -smoothness will lead to the gradient descent with a constant step-size achieves a linear rate of convergence, precisely the oracle complexity will be $O (\frac{β}{α} \log (1 / ε)), β \geq α$ . In some sense strong convexity is a dual assumption to smoothness, and in fact this can be made precise within the framework of Fenchel duality.
$α$ can often be reviewed as large as the sample size. Thus reducing the number of step from sample size to $\sqrt{sample size}$ (cause $κ = \frac{β}{α}$ for $β$ -smooth and $α$ -strong function, and in basic gradient descent algorithm, to reach $ϵ$ -accuracy, it requires $O (κ \log (\frac{1}{ϵ}))$ , and for Nesterov's Accelerated Gradient Descent attains the improved oracle complexity of $O (\sqrt{κ} \log (\frac{1}{ϵ}))$ ) can be a huge deal, especially in large scale applications.

Examples of Convex functions

Norms
Max function
Quadratic-over-linear function $\frac{x^{2}}{y}$
Log-sum-exp: $\log (e^{x_{1}} + \dots + e^{x_{n}})$ , which is regarded as a differentiable approximation of the max function
Geometric mean: $(\prod_{i = 1}^{n} x_{i})^{1 / n}$ , concave
Log-determinant: $\log det X$ , concave

For proofs, check Chapter $3.1 .5$ of Convex Optimization.

Sublevel sets

Epigraph

A function is convex if and only if its epigraph is a convex set.

Jensen's inequality and extensions

Once a function is convex, then you can get

the simplest version of it is $f (\frac{x + y}{2}) \leq \frac{f (x) + f (y)}{2}$ .

Holder's inequality

Operations that preserve convexity

Nonnegative weighted sums: $f = w_{1} f_{1} + \dots + w_{m} f_{m}$
Composition with an affine mapping: $g (x) = f (A x + b)$ . If $f$ is convex, so is $g$ ; if $f$ is concave, so is $g$ .
Pointwise maximum and suprenum: $f (x) = max {f_{1} (x), f_{2} (x)}$ and $f (x) = max {f_{1} (x), \dots, f_{m} (x)}$
Composition: $f (x) = h (g (x))$
- Scalar composition
- Vector composition $f (x) = h (g (x)) = h (g_{1} (x), \dots, g_{k} (x))$
- Minimization

Typical Numerical optimization

Gradient descent

The basic principle behind gradient descent is to make a small step in the direction that minimizes the local first order Taylor approximation of $f$ (also known as the steepest descent direction). This kind of methods will obtain an oracle complexity independent of the dimension.

$x_{t + 1} = x_{t} - η \nabla f (x_{t})$

Taking $f (w) = \frac{1}{2} w^{T} A w - b^{T} w, w \in R^{n}$ into consideration, suppose $A$ is symmetric and invertible, then $A = Q Λ Q^{T}, Λ = (λ_{1}, \dots, λ_{n}), λ_{1} \leq λ_{2} \leq \dots \leq λ_{n - 1} \leq λ_{n}$ .

All errors are not made equal. Indeed, there are different kinds of errors, $n$ to be exact, one for each of the eigenvectors of $A$ .

$f (w^{k}) - f (w^{*}) = \sum (1 - α λ_{i})^{2 k} λ_{i} [x_{i}^{0}]^{2}$
Denote the condition number $κ = \frac{λ_{n}}{λ_{1}}$ , then the bigger the $κ$ is, the slower gradient descent will be, cause the condition number is a direct description of pathological curvature.
The optimal step-size causes the first and last eigenvectors to converge at the same rate.

Projected gradient descent

Subgradient
Projected subgradient descent

Gradient descent with momentum : Polyak's Momentum

Sometimes SGD fail with a reason of pathological curvature of objective. (like valley, trench)
Momentum modify gradient descent by adding a short-term memory

$y_{t + 1} = β y_{t} + \nabla f (x_{t}) x_{t + 1} = x_{t} - α y_{t + 1}$ .

When $β = 0$ , it's gradient descent.
Momentum allows us to crank up the step-size up by a factor of 2 before diverging.
Optimize over $β$ : The critical value of $β = (1 - \sqrt{α λ_{i}})^{2}$ gives us a convergence rate (in eigenspace $i$ ) of $1 - \sqrt{α λ_{i}}$ . A square root improvement over gradient descent, $1 - α λ_{i}$ ! Alas, this only applies to the error in the $i^{t h}$ eigenspace, with $α$ fixed.
Failing: there exist strongly-convex and smooth functions for which, by choosing carefully the hyperparameters $α$ and $β$ and the initial condition $x_{0}$ , the heavy-ball method fails to converge.

Nesterov’s Accelerated Gradient Descent

Some notes of it
Iterations: starting at an arbitrary initial point $x_{1} = y_{1}$

$y_{s + 1} = x_{s} - \frac{1}{β} \nabla f (x_{s}), x_{s + 1} = (1 + \frac{\sqrt{κ} - 1}{\sqrt{κ} + 1}) y_{s + 1} - \frac{\sqrt{κ} - 1}{\sqrt{κ} + 1} y_{s}$
Let $f$ be $β$ -smooth and $α$ -strongly convex, then Nesterov's gradient descent satisfies for $t \geq 0$ , $f (y_{t}) - f (x^{*}) \leq \frac{α + β}{2} ‖ x_{1} - x^{*} ‖^{2} \exp (- \frac{t - 1}{\sqrt{κ}})$ .
Converge in $O (\frac{1}{T^{2}})$ for smooth case, and $O (\exp (- \frac{T}{\sqrt{κ}}))$ , it guaranteed convergence for quadratic functions (and not piece-wise quadractic)

Stochastic gradient descent

Cases: one is $E_{ξ} \nabla_{x} ℓ (x, ξ) \in \part f (x)$ , where $ξ$ is sampled. This method cannot be reproduced; the other is directly minimize $f (x) = \frac{1}{m} \sum_{i = 1}^{m} f_{i} (x)$ , here gradient is reported as $\nabla f_{I} (x)$ , where $I \in [m]$ , this method can be reproduced.
Non-smooth stochastic optimization
- Definition: there exists $B > 0$ such that $E ‖ \tilde{g} (x) ‖_{*}^{2} \leq B^{2}$ for all $x \in X$
Smooth stochastic optimization and mini-batch SGD
- Definition: there exists $σ > 0$ such that $E ‖ \tilde{g} (x) - \nabla f (x) ‖_{*}^{2} \leq σ^{2}$ for all $x \in X$ .
- smoothness does not bring any acceleration for a general stochastic oracle , while in exact orale case it does.
- Stochastic smooth optimization converge in $\frac{1}{\sqrt{t}}$ . Deterministic smooth optimization converge in $\frac{1}{t}$
- Mini-batch SGD converges between $\frac{1}{\sqrt{t}}$ and $\frac{1}{t}$ .

Relations

Generalization

The momentum SGD and Adaptive optimizers

YellowFin and the Art of Momentum Tuning

Summary: hand-tuning a single learning rate and momentum makes it competitive with Adam. The proposed YellowFin (an automatic fine tuner for momentum and learning rate in SGD), can converge in fewer iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing.
Adaptive optimizers: like Adam, AdaGrad, RmsProp
For details of paper check here

Dimension-free convex optimization

Projected subgradient descent for Lipschitz functions

Theorem

Assume that $X$ is contained in an Euclidean ball centered at $x_{1} \in X$ and of radius $R$ . Assume that $f$ is such that for any $x \in X$ and of any $g \in \part f (x)$ (assume $\part f (x) \neq \emptyset$ ) one has $‖ g ‖ \leq L$ . (This implies that $f$ is L-Lipschitz on $X$ , that is $‖ f (x) - f (y) ‖ \leq L ‖ x - y ‖$ )

The projected subgradient descent method with $η = \frac{R}{L \sqrt{t}}$ satisfies $f (\frac{1}{t} \sum_{s = 1}^{t} x_{s}) - f (x^{*}) \leq \frac{R L}{\sqrt{t}}$
- Proof
- The rate is unimprovable from a black-box perspective.

Gradient descent for smooth functions

Theorems under unconstrained cases

In this section all $f$ is a convex and $β$ -smooth function on $R^{n}$ .

Theorem

Let $f$ be convex and $β$ -smooth function on $R^{n}$ . Then gradient descent with $η = \frac{1}{β}$ satisfies $f (x_{t}) - f (x^{*}) \leq \frac{2 β ‖ x_{1} - x^{*} ‖^{2}}{t - 1}$ .

For the proof check $3.2$ in Convex Optimization: Algorithms and Complexity.
Gradient descent attains a much faster rate in $β$ -smooth situation than in the non-smooth case of the previous section.
The Definition of smooth convex functions

The constrained cases

This time consider the projected gradient descent algorithm $x_{t + 1} = \prod_{X} (x_{t} - η \nabla f (x_{t}))$

Lemma

Let $x, y \in X, x^{+} = \prod_{X} (x - \frac{1}{β} \nabla f (x))$ and $g_{X} (x) = β (x - x^{+})$ , then the following holds true: $f (x^{+}) - f (y) \leq g_{X} (x)^{T} (x - y) - \frac{1}{2 β} ‖ g_{X} (x) ‖^{2}$ .

Theorem

Let $f$ be convex and $β$ -smooth function on $X$ . Then projected gradient descent with $η = \frac{1}{β}$ satisfies $f (x_{t}) - f (x^{*}) \leq \frac{3 β ‖ x_{1} - x^{*} ‖^{2} + f (x_{1}) - f (x^{*})}{t}$ .

Strong convexity

Strongly convex and Lipschitz functions

Theorem

Let $f$ be $L$ -Lipschitz and $α$ -strongly convex on $X$ . Then projected gradient descent with $η_{s} = \frac{2}{α (s + 1)}$ satisfies $f (\sum_{s = 1}^{t} \frac{2 s}{t (t + 1)} x_{s}) - f (x^{*}) \leq \frac{2 L^{2}}{α (t + 1)}$ .

The combination of $α$ -strongly convex and $L$ -Lipschitz means that function has to be constrained in a bounded domain.

Strongly convex and smooth functions

Theorem

Let $f$ be $β$ -smooth and $α$ -strongly convex on $X$ , then projected gradient descent with $η = \frac{1}{β}$ satisfies for $t \geq 0$ , $‖ x_{t + 1} - x^{*} ‖^{2} \leq \exp (- \frac{t}{κ}) ‖ x_{a} - x^{*} ‖^{2}$ .

The intuition of changing $α$ and $β$ : If increasing $β$ , the upper bound will be decreased, and if increasing $α$ , the lower bound will be increased.
Lemma

Let $f$ be $β$ -smooth and $α$ -strongly convex on $R^{n}$ , then for all $x, y \in R^{n}$ , one has $(\nabla f (x) - \nabla f (y))^{T} (x - y) \geq \frac{α β}{α + β} ‖ x - y ‖^{2} + \frac{1}{β + α} ‖ \nabla f (x) - \nabla f (y) ‖^{2}$ .
Theorem

Let $f$ be $β$ -smooth and $α$ -strongly convex on $R^{n}$ , $κ = \frac{β}{α}$ as the condition number. Then gradient descent with $η = \frac{2}{β + α}$ satisfies $f (x_{t + 1}) - f (x^{*}) \leq \frac{β}{2} \exp (- \frac{4 t}{κ + 1} ‖ x_{1} - x^{*} ‖^{2})$

Lower bound -- black box

A black-box procedure is a mapping from "history" to the next query point, that is it maps ( $x_{a}, g_{1}, \dots, x_{t}, g_{t}$ ) (with $g_{s} \in \part f (x_{s})$ ) to $x_{t + 1}$ . To simplify, make the following assumption on the black-box procedure: $x_{1} = 0$ and for any $t \geq 0 ， x_{t + 1}$ is in the linear span of $g_{1}, \dots, g_{t}$ , that is $\begin{matrix} (3.15) & x_{t + 1} \in Span (g_{t}, \dots, g_{t}) \end{matrix}$
Theorem
- Let $t \leq n, L, R > 0$ . There exists a convex and $L$ -Lipschitz function $f$ such that for any black-box procedure satisfying $(3.15)$ , $min_{1 \leq s \leq t} f (x_{s}) - min_{x \in B_{2} (R)} f (x_{s}) \geq \frac{R L}{2 (1 + \sqrt{t})}$ , where $B_{2} (R) = {x \in R^{n} : ‖ x ‖ \leq R}$ . There also exists an $α$ -strongly convex and $L$ -Lipschitz function $f$ such that for any black-box procedure satisfying $(3.15)$ , $min_{1 \leq s \leq t} f (x_{s}) - min_{x \in B_{2} (\frac{L}{2 α})} f (x_{s}) \geq \frac{L^{2}}{8 α t}$
- Let $t \leq (n - 1) / 2, β > 0$ . There exists a $β$ -smooth convex function $f$ such that for any black-box procedure satisfying $(3.15)$ , $min_{1 \leq s \leq t} f (x_{s}) - f (x^{*}) \geq \frac{3 β}{32} \frac{‖ x_{1} - x^{*} ‖^{2}}{(t + 1)^{2}}$ .
- Let $κ \geq 1$ . There exists a $β$ -smooth and $α$ -strongly convex function $f : ℓ_{2} \to R$ with $κ = \frac{β}{α}$ such that for any $t \geq 1$ and black-box procedure satisfying $(3.15)$ one has $f (x_{t}) - f (x^{*}) \geq \frac{α}{2} (\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1})^{2 (t - 1)} ‖ x_{1} - x^{*} ‖^{2}$ . Note that for large values of the condition number $κ$ one has $(\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1})^{2 (t - 1)} \approx \exp (- \frac{4 (t - 1)}{\sqrt{κ}})$

References

【凸优化笔记5】-次梯度方法（Subgradient method）
Convex Optimization
Convex Optimization: Algorithms and Complexity
'Understanding Analysis' by Stephen Abbott. It's a nice and light intro to analysis

泛函

不动点定理

紧性的利用

流形(Manifolds)

支撑集(Support)

数学分析

Lipschitz连续

Statistical learning theory

Introduction

Preliminary

PAC

Occam's bound

Stability, Generalization

Convex Optimization

Estimate sequence

Convex sets

Affine sets

Convex sets

Convex functions

Convex functions

Extended-value extensions

First-order conditions

Second-order conditions

More constraints on convex function

β-smooth

α-strong convexity

Examples of Convex functions

Sublevel sets

Epigraph

Jensen's inequality and extensions

Holder's inequality

Operations that preserve convexity

Typical Numerical optimization

Gradient descent

Projected gradient descent

Gradient descent with momentum : Polyak's Momentum

Nesterov’s Accelerated Gradient Descent

Stochastic gradient descent

Relations

Generalization

The momentum SGD and Adaptive optimizers

Dimension-free convex optimization

Projected subgradient descent for Lipschitz functions

Gradient descent for smooth functions

Theorems under unconstrained cases

The constrained cases

Strong convexity

Strongly convex and Lipschitz functions

Strongly convex and smooth functions

Lower bound -- black box

References

$β$ -smooth

$α$ -strong convexity