Why is the subgradient not a descent method?

Question:

I am reading this nice document about the subgradient method, which defines the subgradient method iteration as follows.

x k + 1 = x k − α k g k \begin{align*} x_{k+1}=x_{k}-\alpha_k g^k \end{align*}xk+1=xkαkgk
for a g gg such that
f ( y ) ≥ f ( x ) + g T ( y − x ) \begin{align*} f(y) \geq f(x) + g^T (y-x) \end{align*}f(y)f(x)+gT(yx)
If f ff is differential then g gg is its gradient. This seems to suggest that for any valid g gg, we are ensured to increase (not strictly) the value of f ff. However the same document states the following.

The subgradient method is not a descent method; the function value can (and often does) increase

and proposes to keep track of f ff in each iteration, in order to keep track of the best one.

This seems to contradict the first statement about the choice of g gg, as we are choosing a g gg such that f ff increases, how can the next f ff not be the best so far?


Answer:

I believe your confusion comes from the fact that you inaccurately read the sign of the inequality. If g k g^kgk is a sub-gradient at x k x^kxk, then by taking y = x k + 1 y=x^{k+1}y=xk+1 and x = x k x=x^kx=xk the sub-gradient inequality is:
f ( x k + 1 ) = f ( x k − α g k ) ≥ f ( x k ) + ( g k ) T ( − α g k ) = f ( x k ) − α ∣ ∣ g k ∣ ∣ 2 2 \begin{align*} f(x^{k+1})&=f(x^{k}-\alpha g^k)\\ &\geq f(x^{k})+(g^k)^T(-\alpha g^k)\\ &=f(x^{k})-\alpha ||g^k||_2^2 \end{align*}f(xk+1)=f(xkαgk)f(xk)+(gk)T(αgk)=f(xk)α∣∣gk22

This means that f ( x k + 1 ) f(x^k+1)f(xk+1) can be any value above f ( x k ) − α ∣ ∣ g k ∣ ∣ 2 2 f(x^k)−α||g^k||_2^2f(xk)α∣∣gk22, and in particular, any value above f ( x k ) f(x^k)f(xk). So the sub-gradient inequality does not ensure that it is a descent method.

In contrast to the sub-gradient method, when f ff is differentiable and ∇ f \nabla ff is Lipschitz continuous, you have the Descent Lemma:
f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 2 \begin{align*} f(y) \leq f(x) + \nabla f(x)^T (y-x) + \frac{L}{2} || y-x ||_2^2 \end{align*}f(y)f(x)+f(x)T(yx)+2L∣∣yx22

The Descent Lemma is the property which ensures descent, and it does not necessarily holds when you replace the gradient with a sub-gradient.


参考网址:

  1. Subgradient Methods - MIT
  2. Why is the subgradient not a descent method?
  3. The subgradient method is not a true descent method
  4. Descent lemma of gradient descent algorithm