• about
  • notes
  • projects
  • talks
  • meetings
posts papers about

  • about
  • notes
  • projects
  • talks
  • meetings

Notes of Mathematics

2019-06-19

泛函

不动点定理

  1. 不动点定理的基本逻辑:对于一个存在性问题,构造一个度量空间和一个映射,使得存在性问题等价于这个映射的不动点。只要证明这个映射存在不动点,那么原来的存在性问题即得证。

    链接

紧性的利用

  • 证明存在性
  • 在无限维空间中“模仿”有限维的欧式空间

紧集是为了模仿描述欧式空间中的有界闭集合么? 紧=相对紧+闭

流形(Manifolds)

  1. 概念:高维空间中曲线、曲面概念的推广,如三维空间中的曲面为一二维流形。

支撑集(Support)

  1. 概念:函数的非零部分子集;一个概率分布的支撑集为所有概率密度非零部分的集合

数学分析

Lipschitz连续

  1. 若存在一个常数K,使得定义域内的任意两点x1,x2满足: $ |f(x_1)-f(x_2)|=K|x_1-x_2|$ 则称函数为Lipschitz连续函数。此性质限定了f的导函数的绝对值不超过K,规定了函数的最大局部变动幅度。

Statistical learning theory

Introduction

Determine how well a model performs on unseen data

Preliminary

  • Markov Inequality

    image-20210210155914107, in the order of O(1deviation)

  • Chebyshev's Inequality

    image-20210210155723575, in the order of O(1deviation2)

  • Generic Chernoff's Bound

    image-20210210160010328

  • Hoeffding's Inequality

    • Hoeffding's lemma

      image-20210210160051227

    • Hoeffding's Inequality

      image-20210210160116965
      image-20210210160116965

      Hoeffding's inequality is useful to bound the probability of the gap between an empirical value and the true expectation of an average of bounded random variables.

  • McDiarmid's Inequality

    image-20210210181355241

    • A concentration inequality.
    • This bound is useful because if we prove that an algorithm is β stable then we will have this property on a specific function.

PAC

Check here for PAC, VC, uniform bound and others.

  • PAC Learning (agnostic PAC learnable)--finite hypothesis class

    image-20210210155242200
    image-20210210155242200

    image-20210210090621996

    • Consider a simple linear classifier with 2 weights w→=(w1;w2), which are stored using a 32 bit floats. This implies that the hypothesis class is finite with $|| = 2^{32} $.
    • This theorem works on finite hypothesis class, and answers that for a hypothesis class H, to make the generation gap is smaller than ϵ with at least 1−δ, the required samples complexity.
    • Once a hypothesis class is PAC learnable, with high probability the training set is ϵ-representative.
    • Note Suppose H is PAC learnable, there is not a unique function mH that satisfies the requirements given in the definition of PAC learnability.
    • Finite classes are PAC Learnable, also agnostic PAC Learnable.
  • Uniform convergence

    • Formalize that over all hypothesis in H, the empirical risk is close to the true risk. This will make sure the ERM to work.
    • image-20210210161725993
  • VC dimension

    • measure the complexity of hypothesis class other than cardinality.

    • Shattering

      image-20210210163136908
      image-20210210163136908
    • VC Dimension

      image-20210210163157194
      image-20210210163157194

      Informally VC dimension is the maximum number of distinct points that a hypothesis in H can correctly classify every possible labeling with zero error.

      • With infinite VC dimension, the hypothesis class won't be PAC learnable.
      • There exist hypothesis classes with uncountable cardinality but finite VC dimension.
      • For every two hypothesis classes if H0⊂H then VCdim(H0)≤VCdim(H).

Occam's bound

  • The PAC bound can be treated as Occam's bound with a uniform prior .

  • Occam's bound will put a distribution over the countably infinite hypothesis class H that is independent of dataset S we will receive. In doing so we will get bounds on the generalization gap that no longer depend on the size of the hypothesis class, |H|. These bounds now become variable depending on how we weigh each individual hypothesis h, i.e. P(h).

  • image-20210210090847116

    • Regularizers offers higher probability assigned to w→ near the origin and thus a tighter bound, it won't influence the algorithm (loss). Specifically , when an ℓ2 regularization term is added to the learning algorithm, it adds concavity to the loss function.

      image-20210210092846655

  • image-20210210091000849, where D(Q‖P) is the KL divergence which serves as a complexity measure.

    • It tells that with a good posterior that is close to the prior, then the KL-divergence will become smaller and out bound will be tighter. Even though it's tight, the bound is tight for hypothesis that we may not care about, e.g. tight on the bound with respect to P prior on hypothesis.

    • Note that the posterior is after applying the prior P on hypothesis, and seeing the data, then one can get this posterior Q.

    • Why posterior?

      different choices of prior and posterior hypotheses can be made, each resulting in a new bound without us touching the algorithm.

    • The dropout PAC-Bayes is a lower bound on the PAC-Bayes bound that becomes tight when the dropout factor is 0.

Stability, Generalization

  • PAC learning and Occam's bound work as algorithm-agnostic bounds.

  • Stability

    • Hint: a change in data distribution does not change the predictions.

    • Definition: uniform stability

      image-20210210183949427

      An algorithm with this property can be understood as one that produces a hypothesis such that the loss function ℓ is not drastically affected by perturbing the dataset in this manner.

    • EMR with regularization is β-stable.

      image-20210210211826259

      • If we perturb the data by a single element, we learn A that can become arbitrarily close for large n.
    • SGD is stable

      image-20210210214207908

      • It holds for a finite number of steps T
  • Defect:

    D[hS]=R[hS]−R^S[hS]

    • Defect D[hS] for a hypothesis hS derived from an algorithm after seeing the dataset S is defined as the difference between the population risk and the empirical risk.

    • It's expectation is not zero.

    • The expectation value of defect can be bounded under certain conditions.

      If $ $ is a β-uniformly stable algorithm, then −β≤E[D[hS]]≤β.

    • Let A be a β-uniformly stable learning algorithm with respect to a loss function ,ℓ:Y×Y→[0,M]. The absolute difference of the defect calculated on a dataset S and on a perturbed version of the dataset Si,z is bounded by |D[hS]−D[hSi,z]|≤2β+Mn.

  • For a β-uniformly stable algorithm, the relationship between the empirical and the population risk is E[R[hS]]≤ES[R^S[hS]]+β.

    • It's a bound on the expectation value of the population risk. But this bound does not hold for all possible hS.
    • image-20210210205331186
  • For a β-uniformly stable algorithm A with respect to a loss function ,ℓ:Y×Y→[0,M] and a hypothesis hS with |S|=n. The relationship between the empirical and the population risk holds with probability 1−δ: R[hS]≤R^S[hS]+β+(nβ+M2)2log⁡2δn.

    • The last term is a concentration Inequality (McDiramid's)
    • For this bound, as n goes up, it becomes less tight.

Convex Optimization

Book Convex Optimization, Convex Optimization: Algorithms and Complexity

Terminology Definition
Convex function: origin f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)
Convex function: 1st order (once differential ) $f(y)f(x)+f(x)^T (y-x) $
Convex function: 2nd order (twice differential ) ∇2f(x)⪰0, the eigenvalues won't be negative
L-Lipschitz ‖f(x)−f(y)‖≤L‖x−y‖
β-smooth: β-Lipschitz on gradient ‖∇f(x)−∇f(y)‖≤β‖x−y‖⇒∇2f(x)⪯βI
α-strong convex, limited the domain mostly. f(y)−f(x)≤∇f(x)T(y−x)−α2‖y−x‖2⇒∇2f(x)⪰αI
Table: Optimizer in different conditions

For each optimizer, from the top line downwarding, the rate of convergence is increasing. The optimal step size is gotten by minimizing the bound.
Optimizer Condition Converge rate Optimal step size Sub-optimal gap Bounds of the gap
GD after T steps L-Lipschitz convex O(1T) γ=‖x1−x∗‖2LT f(1T∑k=1Txk)−f(x∗) ≤‖x1−x∗‖LT,
the initial point matters
GD β-smooth
+convex
O(1T) γ=1β,
constant and independent of T
f(xk)−f(x∗)
Notice the average on all samples can be erased
≤2β‖x1−x∗‖2k−1,
k means the k step.
Bound depends on initial point
Projected subGD after T steps α-strong
+ L-Lipschitz
O(1T) γk=2α(k+1),
diminish at every step
f(∑k=1T2kT(T+1)xk)−f(x∗) ≤2L2α(T+1)
GD λ-strong
+ β-smooth
O(exp⁡(−T)) γ=2λ+β f(xt+1)−f(x∗)
Notice the average on all samples can be erased
≤β2exp⁡(−4tκ+1)‖x1−x∗‖2,
$ =.<br/>k$ is the same meaning of t,
rather than the κ in denominator.
Polyak (heavy ball) Quadratic loss O((κ−1κ+1)t)≈exp⁡(−Cκ),
κ=hmaxhmin
γ∗=(1+μ)2hmax=(1−μ)2hmin ‖[xt+1−x∗xt−x∗]‖2 ≤O(ρ(A)T)=O(μT).
μ is the curvature
Nesterov NAG β-smooth O(1T2) f(yt)−f(x∗)
Nesterov NAG α-strong
+β-smooth
O(exp⁡(−Tκ)) f(yt)−f(x∗) ≤α+β2‖xk−x∗‖2exp⁡(−t−1κ)
SGD L-Lipschitz by
‖g~(x)‖≤L with prob. 1
O(1T)
If wants a ϵ-tolerance, T≥B2L2ϵ2
γ=BLT E[f(x¯)]−f(x∗), where
x∗∈arg⁡minx:‖x‖<Bf(x)
≤BLT
SGD α-strong+ E‖g~(x)‖∗2≤B2 (kind of B-smooth) O(1T) γ=2α(s+1) f(∑s=1t2st(t+1)xs)−f(x∗) ≤2B2α(t+1)

Estimate sequence

  • Definition

    image-20210129143844647

  • Properties

    image-20210131115338221

    for any sequence {λk}, satisfying (2.2.2) we can derive the rate of convergence of the minimization process directly from the rate of convergence of the sequence {λk}.

    Note: below in estimate sequence, all L means L-smooth function

  • How to form an estimate sequence?

    image-20210131150935371

    image-20210131151028446

  • How to ensure (2.2.2)?

    • Method 1: Do scheme (2.2.6), it will generate a sequence {xk}k=0\infin such that f(xk)−f∗≤λk[f(x0)−f∗+γ02‖x0−x∗‖2]. It will make sequence satisfy (2.2.2)

    • Method 2: Using gradient step

      image-20210201094020408

  • Some cases

    • If in the scheme (2.2.6) we choose γ0≥μ, then λk≤min{(1−μL)k,4L(2L+kγ)2}.
    • If in the scheme (2.2.6) we choose γ0=L, then this scheme generates a sequence {xk}k=0\infin such that f(xk)−f∗≤Lmin{(1−μL)k,4(k+2)2}‖x0−x∗‖2. This  means that it's optimal for the class ,Sμ,L1,1(Rn) with μ≥0. 
    • If in the scheme (2.2.8) we choose α0≥μL, then this scheme generates a sequence {xk}k=0\infin such that f(xk)−f∗≤min{(1−μL)k,4L(2L+kγ0)2}[f(x0)−f∗+γ02‖x0−x∗‖2], where γ0=α0(α0L−μ)1−α0. Here α0≥μL is equivalent to γ0≥μ.

Convex sets

Affine sets

  • Affine sets:
    • Definition : A set C⊆Rn is affine if the line through any two distinct points in C lies in C, aka for any x1,x2∈C and θ∈R, one has θx1+(1−θ)x2∈C. It indicates that the C contains the linear combination of any two points in C.
    • Induction: If C is an affine set, ,x1,⋯,xk∈C and θ1+⋯+θk=1, then the point θ1x1+⋯+θkxk also belongs to C.
  • Affine hull
    • Definition: the set of all affine combinations of points in some set C⊆Rn, denoted as affC
    • It's the smallest affine set that contains C
  • Affine dimension
    • Definition: as the dimension of its affine hull.
    • E.g.: {x∈R2|x12+x22=1}, the affine dimension is 2.

Convex sets

  • Definition: If every point in the set can be seen by every other point, along an unobstructed straight path between them, where unobstructed means lying in the set.
  • Every affine set is also convex.
  • Convex hull, denotes as convC, is the set of all convex combinations of points in C. It is always convex, and it's the smallest convex set that contains C
  • More generally, suppose p:Rn→R satisfies p(x)≥0 for all x∈C and ∫Cp(x)dx=1, where C⊆Rn is convex, then ∫Cp(x)xdx∈C, if the integral exists.

Convex functions

Convex functions

  • Definition

    image-20210117165609361

  • All affine function are both convex and concave.

  • f is convex if and only if for all x∈domf and all v, the function g(t)=f(x+tv) is convex.

Extended-value extensions

image-20210117170246166

First-order conditions

image-20210117171410378

The inequality (3.2) states that for a convex function, the first-order Taylor approximation is in fact a global underestimator of the function. Conversely, if the first-order Taylor approximation of a function is always a global underestimator of the function, then the function is convex.

  • The inequality (3.2) shows that if ∇f(x)=0, then for all y∈domf,f(y)≥f(x), i.e., x is a global minimizer of the function f.

Second-order conditions

image-20210117173919136

More constraints on convex function

β-smooth
  • Definition: a continuously function f is β-smooth if the gradient ∇f is β-Lipschitz, that is ‖∇f(x)−∇f(y)‖≤β‖x−y‖.
  • If f is twice differentiable then this is equivalent to the eigenvalues of the Hessians being smaller than β at any point. , ∇2f(x)⪯βIn,∀x.
  • smoothness removes dependency from the averaging scheme
  • If extend the β-smooth to multi power, it's called Holder condition
  • The bigger your function changes in gradients, the upper you have to explore.
α-strong convexity

Strong convexity can significantly speed-up the convergence of first order methods.

  • Definition

    We say that f:X→R is a α-strongly convex if it satisfies the following improved subgradient inequality:

    f(x)−f(y)≤∇f(x)T(x−y)−α2‖x−y‖2. A large value of α will lead to a faster rate. A α-strong convexity for twice differential function f can also be interpreted as ∇2f(x)⪰αIn,∀x.

  • Strong convexity plus β-smoothness will lead to the gradient descent with a constant step-size achieves a linear rate of convergence, precisely the oracle complexity will be O(βαlog⁡(1/ε)),β≥α. In some sense strong convexity is a dual assumption to smoothness, and in fact this can be made precise within the framework of Fenchel duality.

  • α can often be reviewed as large as the sample size. Thus reducing the number of step from sample size to samplesize (cause κ=βα for β-smooth and α-strong function, and in basic gradient descent algorithm, to reach ϵ-accuracy, it requires O(κlog⁡(1ϵ)), and for Nesterov's Accelerated Gradient Descent attains the improved oracle complexity of O(κlog⁡(1ϵ))) can be a huge deal, especially in large scale applications.

Examples of Convex functions

  • Norms
  • Max function
  • Quadratic-over-linear function x2y
  • Log-sum-exp: log⁡(ex1+⋯+exn), which is regarded as a differentiable approximation of the max function
  • Geometric mean: (∏i=1nxi)1/n, concave
  • Log-determinant: log⁡detX, concave

For proofs, check Chapter 3.1.5 of Convex Optimization.

Sublevel sets

image-20210117181258728

Epigraph

A function is convex if and only if its epigraph is a convex set.

image-20210117181415039

Jensen's inequality and extensions

Once a function is convex, then you can get

image-20210117185801678 the simplest version of it is f(x+y2)≤f(x)+f(y)2.

Holder's inequality

image-20210117190336823

Operations that preserve convexity

  • Nonnegative weighted sums: f=w1f1+⋯+wmfm

  • Composition with an affine mapping: g(x)=f(Ax+b). If f is convex, so is g; if f is concave, so is g.

  • Pointwise maximum and suprenum: f(x)=max{f1(x),f2(x)} and f(x)=max{f1(x),⋯,fm(x)}

  • Composition: f(x)=h(g(x))

    • Scalar composition

      image-20210117190943043

    • Vector composition f(x)=h(g(x))=h(g1(x),⋯,gk(x))

      image-20210117191112754

    • Minimization

Typical Numerical optimization

Gradient descent

The basic principle behind gradient descent is to make a small step in the direction that minimizes the local first order Taylor approximation of f (also known as the steepest descent direction). This kind of methods will obtain an oracle complexity independent of the dimension.

xt+1=xt−η∇f(xt)

Taking f(w)=12wTAw−bTw,w∈Rn into consideration, suppose A is symmetric and invertible, then A=QΛQT,Λ=(λ1,⋯,λn),λ1≤λ2≤⋯≤λn−1≤λn.

  • All errors are not made equal. Indeed, there are different kinds of errors, n to be exact, one for each of the eigenvectors of A.

    f(wk)−f(w∗)=∑(1−αλi)2kλi[xi0]2

  • Denote the condition number κ=λnλ1, then the bigger the κ is, the slower gradient descent will be, cause the condition number is a direct description of pathological curvature.

  • The optimal step-size causes the first and last eigenvectors to converge at the same rate.

    image-20210124212843050

Projected gradient descent

  • Subgradient

    image-20210117222306873

  • Projected subgradient descent

    image-20210117222725673

    image-20210117222937543

Gradient descent with momentum : Polyak's Momentum

  • Sometimes SGD fail with a reason of pathological curvature of objective. (like valley, trench)

  • Momentum modify gradient descent by adding a short-term memory

    yt+1=βyt+∇f(xt)xt+1=xt−αyt+1.

    When β=0, it's gradient descent.

    image-20210124224747333

  • Momentum allows us to crank up the step-size up by a factor of 2 before diverging.

    image-20210124225628085

    image-20210124225656929

  • Optimize over β: The critical value of β=(1−αλi)2 gives us a convergence rate (in eigenspace i) of 1−αλi. A square root improvement over gradient descent, 1−αλi! Alas, this only applies to the error in the ith eigenspace, with α fixed.

  • image-20210124231402777

  • Failing: there exist strongly-convex and smooth functions for which, by choosing carefully the hyperparameters α and β and the initial condition x0, the heavy-ball method fails to converge.

Nesterov’s Accelerated Gradient Descent

  • Some notes of it

  • Iterations: starting at an arbitrary initial point x1=y1

    ys+1=xs−1β∇f(xs),xs+1=(1+κ−1κ+1)ys+1−κ−1κ+1ys

  • Let f be β-smooth and α-strongly convex, then Nesterov's gradient descent satisfies for t≥0, f(yt)−f(x∗)≤α+β2‖x1−x∗‖2exp⁡(−t−1κ) .

  • Converge in O(1T2) for smooth case, and O(exp⁡(−Tκ)), it guaranteed convergence for quadratic functions (and not piece-wise quadractic)

Stochastic gradient descent

  • Cases: one is Eξ∇xℓ(x,ξ)∈\partf(x), where ξ is sampled. This method cannot be reproduced; the other is directly minimize f(x)=1m∑i=1mfi(x), here gradient is reported as ∇fI(x), where I∈[m], this method can be reproduced.

  • Non-smooth stochastic optimization

    • Definition: there exists B>0 such that E‖g~(x)‖∗2≤B2 for all x∈X
    • image-20210202153043826
    • image-20210202153116696
  • Smooth stochastic optimization and mini-batch SGD

    • Definition: there exists σ>0 such that E‖g~(x)−∇f(x)‖∗2≤σ2 for all x∈X.

    • smoothness does not bring any acceleration for a general stochastic oracle , while in exact orale case it does.

    • Stochastic smooth optimization converge in 1t. Deterministic smooth optimization converge in 1t

    • Mini-batch SGD converges between 1t and  1t.

      image-20210202154653243

Relations

Generalization

image-20210124232634072

The momentum SGD and Adaptive optimizers

YellowFin and the Art of Momentum Tuning

  • Summary: hand-tuning a single learning rate and momentum makes it competitive with Adam. The proposed YellowFin (an automatic fine tuner for momentum and learning rate in SGD), can converge in fewer iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing.
  • Adaptive optimizers: like Adam, AdaGrad, RmsProp
  • For details of paper check here

Dimension-free convex optimization

Projected subgradient descent for Lipschitz functions

  • Theorem

    Assume that X is contained in an Euclidean ball centered at x1∈X and of radius R. Assume that f is such that for any x∈X and of any g∈\partf(x) (assume \partf(x)≠∅) one has ‖g‖≤L. (This implies that f is L-Lipschitz on X, that is ‖f(x)−f(y)‖≤L‖x−y‖)

    The projected subgradient descent method with η=RLt satisfies f(1t∑s=1txs)−f(x∗)≤RLt

    • Proof

      image-20210117225359857

    • The rate is unimprovable from a black-box perspective.

Gradient descent for smooth functions

Theorems under unconstrained cases

In this section all f is a convex and β-smooth function on Rn.

  • Theorem

    Let f be convex and β-smooth function on Rn. Then gradient descent with η=1β satisfies f(xt)−f(x∗)≤2β‖x1−x∗‖2t−1 .

    For the proof check 3.2 in Convex Optimization: Algorithms and Complexity.

  • Gradient descent attains a much faster rate in β-smooth situation than in the non-smooth case of the previous section.

  • The Definition of smooth convex functions

    image-20210117233244773

The constrained cases

This time consider the projected gradient descent algorithm xt+1=∏X(xt−η∇f(xt))

  • Lemma

Let x,y∈X,x+=∏X(x−1β∇f(x)) and gX(x)=β(x−x+) , then the following holds true: f(x+)−f(y)≤gX(x)T(x−y)−12β‖gX(x)‖2.

  • Theorem

Let f be convex and β-smooth function on X. Then projected gradient descent with η=1β satisfies f(xt)−f(x∗)≤3β‖x1−x∗‖2+f(x1)−f(x∗)t .

Strong convexity

Strongly convex and Lipschitz functions

image-20210118152328268

  • Theorem

Let f be L-Lipschitz and α-strongly convex  on X. Then projected gradient descent with ηs=2α(s+1) satisfies f(∑s=1t2st(t+1)xs)−f(x∗)≤2L2α(t+1) .

The combination of α-strongly convex and L-Lipschitz means that function has to be constrained in a bounded domain.

Strongly convex and smooth functions
  • Theorem

    Let f be β-smooth and α-strongly convex on X, then projected gradient descent with η=1β satisfies for t≥0, ‖xt+1−x∗‖2≤exp⁡(−tκ)‖xa−x∗‖2 .

    The intuition of changing α and β: If increasing β, the upper bound will be decreased, and if increasing α, the lower bound will be increased. image-20210121112605020

  • Lemma

    Let f be β-smooth and α-strongly convex on Rn, then for all x,y∈Rn, one has (∇f(x)−∇f(y))T(x−y)≥αβα+β‖x−y‖2+1β+α‖∇f(x)−∇f(y)‖2 .

  • Theorem

     Let f be β-smooth and α-strongly convex  on Rn, κ=βα as the condition number. Then gradient descent with η=2β+α satisfies f(xt+1)−f(x∗)≤β2exp⁡(−4tκ+1‖x1−x∗‖2)  

Lower bound -- black box

  • A black-box procedure is a mapping from "history" to the next query point, that is it maps (xa,g1,⋯,xt,gt) (with gs∈\partf(xs)) to xt+1. To simplify, make the following assumption on the black-box procedure: x1=0 and for any ,t≥0,xt+1 is in the linear span of g1,⋯,gt, that is (3.15)xt+1∈Span(gt,⋯,gt)

  • Theorem

    •  Let t≤n,L,R>0. There exists a convex and L-Lipschitz function f such that for any black-box procedure satisfying  (3.15), min1≤s≤tf(xs)−minx∈B2(R)f(xs)≥RL2(1+t), where B2(R)={x∈Rn:‖x‖≤R} . There also exists an α-strongly convex and L-Lipschitz function f such that for any black-box procedure satisfying (3.15), min1≤s≤tf(xs)−minx∈B2(L2α)f(xs)≥L28αt
    • Let t≤(n−1)/2,β>0. There exists a β-smooth convex function f such that for any black-box procedure satisfying (3.15), min1≤s≤tf(xs)−f(x∗)≥3β32‖x1−x∗‖2(t+1)2.
    • Let κ≥1. There exists a β-smooth and α-strongly convex function f:ℓ2→R with κ=βα such that for any t≥1 and black-box procedure satisfying (3.15) one has f(xt)−f(x∗)≥α2(κ−1κ+1)2(t−1)‖x1−x∗‖2. Note that for large values of the condition number κ one has (κ−1κ+1)2(t−1)≈exp⁡(−4(t−1)κ)

References

  • 【凸优化笔记5】-次梯度方法(Subgradient method)
  • Convex Optimization
  • Convex Optimization: Algorithms and Complexity
  • 'Understanding Analysis' by Stephen Abbott. It's a nice and light intro to analysis
  • math
  • notes
Generative Adversarial Networks and Remote Sensing, July 26th, 2019.
About
  1. 1. 泛函
    1. 1.1. 不动点定理
    2. 1.2. 紧性的利用
    3. 1.3. 流形(Manifolds)
    4. 1.4. 支撑集(Support)
  2. 2. 数学分析
    1. 2.1. Lipschitz连续
  3. 3. Statistical learning theory
    1. 3.1. Introduction
    2. 3.2. Preliminary
    3. 3.3. PAC
    4. 3.4. Occam's bound
    5. 3.5. Stability, Generalization
  4. 4. Convex Optimization
    1. 4.1. Estimate sequence
    2. 4.2. Convex sets
      1. 4.2.1. Affine sets
      2. 4.2.2. Convex sets
    3. 4.3. Convex functions
      1. 4.3.1. Convex functions
      2. 4.3.2. Extended-value extensions
      3. 4.3.3. First-order conditions
      4. 4.3.4. Second-order conditions
      5. 4.3.5. More constraints on convex function
        1. 4.3.5.1. β-smooth
        2. 4.3.5.2. α-strong convexity
      6. 4.3.6. Examples of Convex functions
      7. 4.3.7. Sublevel sets
      8. 4.3.8. Epigraph
      9. 4.3.9. Jensen's inequality and extensions
      10. 4.3.10. Holder's inequality
      11. 4.3.11. Operations that preserve convexity
    4. 4.4. Typical Numerical optimization
      1. 4.4.1. Gradient descent
      2. 4.4.2. Projected gradient descent
      3. 4.4.3. Gradient descent with momentum : Polyak's Momentum
      4. 4.4.4. Nesterov’s Accelerated Gradient Descent
      5. 4.4.5. Stochastic gradient descent
      6. 4.4.6. Relations
        1. 4.4.6.1. Generalization
        2. 4.4.6.2. The momentum SGD and Adaptive optimizers
    5. 4.5. Dimension-free convex optimization
      1. 4.5.1. Projected subgradient descent for Lipschitz functions
      2. 4.5.2. Gradient descent for smooth functions
        1. 4.5.2.1. Theorems under unconstrained cases
        2. 4.5.2.2. The constrained cases
      3. 4.5.3. Strong convexity
        1. 4.5.3.1. Strongly convex and Lipschitz functions
        2. 4.5.3.2. Strongly convex and smooth functions
      4. 4.5.4. Lower bound -- black box
  5. 5. References
© 2022 Mia Feng
Hexo Theme Yilia by Litten
  • posts
  • papers
  • about

tag:

  • notes
  • Sensor
  • GPR
  • summer school
  • mathematics
  • cnn
  • compress
  • infrared
  • hyperspectral
  • logistic
  • hexo
  • blog
  • applications
  • tendency
  • ML
  • naive
  • da
  • data assimilation
  • machine learning
  • book
  • gcn
  • math
  • paper
  • SSL
  • survey
  • skeleton
  • anomaly
  • cv
  • 3d shape
  • fall
  • remote sensing images
  • interpolation
  • compression
  • reconstruction
  • AI
  • Data assimilation

    缺失模块。
    1、请确保node版本大于6.2
    2、在博客根目录(注意不是yilia根目录)执行以下命令:
    npm i hexo-generator-json-content --save

    3、在根目录_config.yml里添加配置:

      jsonContent:
        meta: false
        pages: false
        posts:
          title: true
          date: true
          path: true
          text: false
          raw: false
          content: false
          slug: false
          updated: false
          comments: false
          link: false
          permalink: false
          excerpt: false
          categories: false
          tags: true
    

  • wind interpolation

       Mia,

       a master student in NUDT.

       I love reading, exercising and cooking.

       Science fictions, detective novels are my favorite.