Question:
I am reading this nice document about the subgradient method, which defines the subgradient method iteration as follows.
x k + 1 = x k − α k g k \begin{align*} x_{k+1}=x_{k}-\alpha_k g^k \end{align*}xk+1=xk−αkgk
for a g gg such that
f ( y ) ≥ f ( x ) + g T ( y − x ) \begin{align*} f(y) \geq f(x) + g^T (y-x) \end{align*}f(y)≥f(x)+gT(y−x)
If f ff is differential then g gg is its gradient. This seems to suggest that for any valid g gg, we are ensured to increase (not strictly) the value of f ff. However the same document states the following.
The subgradient method is not a descent method; the function value can (and often does) increase
and proposes to keep track of f ff in each iteration, in order to keep track of the best one.
This seems to contradict the first statement about the choice of g gg, as we are choosing a g gg such that f ff increases, how can the next f ff not be the best so far?
Answer:
I believe your confusion comes from the fact that you inaccurately read the sign of the inequality. If g k g^kgk is a sub-gradient at x k x^kxk, then by taking y = x k + 1 y=x^{k+1}y=xk+1 and x = x k x=x^kx=xk the sub-gradient inequality is:
f ( x k + 1 ) = f ( x k − α g k ) ≥ f ( x k ) + ( g k ) T ( − α g k ) = f ( x k ) − α ∣ ∣ g k ∣ ∣ 2 2 \begin{align*} f(x^{k+1})&=f(x^{k}-\alpha g^k)\\ &\geq f(x^{k})+(g^k)^T(-\alpha g^k)\\ &=f(x^{k})-\alpha ||g^k||_2^2 \end{align*}f(xk+1)=f(xk−αgk)≥f(xk)+(gk)T(−αgk)=f(xk)−α∣∣gk∣∣22
This means that f ( x k + 1 ) f(x^k+1)f(xk+1) can be any value above f ( x k ) − α ∣ ∣ g k ∣ ∣ 2 2 f(x^k)−α||g^k||_2^2f(xk)−α∣∣gk∣∣22, and in particular, any value above f ( x k ) f(x^k)f(xk). So the sub-gradient inequality does not ensure that it is a descent method.
In contrast to the sub-gradient method, when f ff is differentiable and ∇ f \nabla f∇f is Lipschitz continuous, you have the Descent Lemma:
f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 2 \begin{align*} f(y) \leq f(x) + \nabla f(x)^T (y-x) + \frac{L}{2} || y-x ||_2^2 \end{align*}f(y)≤f(x)+∇f(x)T(y−x)+2L∣∣y−x∣∣22
The Descent Lemma is the property which ensures descent, and it does not necessarily holds when you replace the gradient with a sub-gradient.