Last modified: December 22, 2024
This article is written in: 🇺🇸
Gradient Descent
Gradient Descent is a fundamental first-order optimization algorithm widely used in mathematics, statistics, machine learning, and artificial intelligence. Its principal aim is to find the minimum of a given differentiable function $f(x)$. Instead of searching blindly, it uses gradient information — the direction of steepest increase — to iteratively move toward a minimum by taking steps in the opposite direction, known as the steepest descent direction.
Because it only requires the gradient of the function rather than its Hessian or other higher-order derivatives, gradient descent is often the method of choice for large-scale optimization problems. For instance, in machine learning, gradient descent is the backbone of training many models, from linear regression to deep neural networks.
Conceptual Illustration:
Imagine the function $f(x)$ as a landscape of hills and valleys. Gradient descent resembles a blindfolded hiker who can feel the slope of the ground beneath their feet. The hiker takes small steps in the direction of the steepest downward slope to eventually reach a valley (a minimum of $f$).
In higher dimensions (e.g., $x \in \mathbb{R}^n$), this idea generalizes similarly, just with gradients represented as vectors.
Mathematical Formulation
Given a differentiable function $f : \mathbb{R}^n \to \mathbb{R}$, we want to solve:
$$\min_{x \in \mathbb{R}^n} f(x)$$
Gradient Descent Update Rule:
At iteration $n$, we have the current approximation $x_n$. The gradient at $x_n$, denoted $\nabla f(x_n)$, points in the direction of greatest increase of $f$. To move towards a minimum, we take a step opposite to the gradient direction:
$$x_{n+1} = x_n - \alpha \nabla f(x_n)$$
where $\alpha > 0$ is the learning rate (also called the step size).
Key Components:
I. Gradient $\nabla f(x)$: A vector of partial derivatives indicating the direction and rate of fastest increase of $f$.
II. Learning Rate $\alpha$: Determines how far we move in the direction of the negative gradient. If $\alpha$ is too large, we might overshoot the minimum; if it’s too small, convergence may be very slow.
Over successive iterations, provided the learning rate is chosen suitably and the function is well-behaved (e.g., convex), $x_n$ converges to a point $x^$ where $\nabla f(x^) = 0$, indicating a critical point, often a minimum.
Derivation
I. First-Order Taylor Expansion:
Consider the first-order Taylor expansion of $f$ at $x_n$:
$$f(x_{n+1}) \approx f(x_n) + \nabla f(x_n)^\top (x_{n+1}-x_n).$$
To reduce $f$, we want $x_{n+1}$ to move in a direction that decreases this linear approximation. The direction that maximally decreases $f$ is the opposite of $\nabla f(x_n)$.
II. Steepest Descent:
The direction of steepest descent is given by $-\nabla f(x_n)$. To decide how far to move in that direction, we introduce the learning rate $\alpha$. Thus, the next iterate is:
$$x_{n+1} = x_n - \alpha \nabla f(x_n).$$
III. Convergence Considerations:
Under suitable conditions (e.g., if $f$ is convex and $\alpha$ is sufficiently small), the sequence ${x_n}$ generated by gradient descent converges to a global minimum:
$$\lim_{n \to \infty} x_n = x^,$$ where $x^$ is a minimizer of $f$.
Algorithm Steps
Input:
- A differentiable function $f(x)$.
- A starting point $x_0$.
- A learning rate $\alpha > 0$.
- A convergence tolerance $\epsilon > 0$ or a maximum number of iterations $n_{\max}$.
Initialization:
Set $n = 0$.
Iteration:
I. Compute the gradient at the current point:
$$g_n = \nabla f(x_n).$$
II. Update the point:
$$x_{n+1} = x_n - \alpha g_n.$$
III. Check for convergence:
- If $\|x_{n+1} - x_n\| < \epsilon$ or $n > n_{\max}$, stop.
- Otherwise, increment $n = n+1$ and repeat from step I.
Output:
- Approximate minimum $x_{n+1}$.
- Number of iterations performed.
Example
Given Function:
$$f(x) = x^2.$$
We know $f$ is minimized at $x = 0$. Let’s apply gradient descent step-by-step to illustrate how the algorithm converges to $x=0$.
Setup:
- Initial guess: $x_0 = 5$.
- Learning rate: $\alpha = 0.1$.
- Gradient: $f'(x) = 2x$.
Iteration 1:
- Current point: $x_0 = 5$.
- Compute gradient: $f'(5) = 2 \cdot 5 = 10$.
- Update:
$$x_1 = x_0 - \alpha f'(x_0) = 5 - 0.1 \cdot 10 = 5 - 1 = 4.$$
Iteration 2:
- Current point: $x_1 = 4$.
- Compute gradient: $f'(4) = 2 \cdot 4 = 8$.
- Update:
$$x_2 = x_1 - \alpha f'(x_1) = 4 - 0.1 \cdot 8 = 4 - 0.8 = 3.2.$$
Iteration 3:
- Current point: $x_2 = 3.2$.
- Compute gradient: $f'(3.2) = 2 \cdot 3.2 = 6.4$.
- Update:
$$x_3 = x_2 - \alpha f'(x_2) = 3.2 - 0.1 \cdot 6.4 = 3.2 - 0.64 = 2.56.$$
Continuing this process, we observe that $x_n$ keeps getting closer to 0. Over many iterations, the point will approach the exact minimum at $x = 0$.
Advantages
I. Simplicity and Ease of Implementation:
Gradient descent requires only the first-order gradient and is straightforward to code, making it popular across many applications.
II. Scalability:
It is well-suited for problems with very large parameter spaces, such as deep learning models with millions of parameters. More advanced variants (e.g., stochastic gradient descent) are widely used in machine learning.
III. No Need for Second-Order Information:
Unlike methods requiring Hessians, gradient descent only needs first-order derivatives, which are often cheaper to compute.
IV. Versatility:
It can be applied to a broad class of differentiable functions and is often effective for complex, high-dimensional optimization problems.
Limitations
I. Learning Rate Sensitivity:
Choosing a good learning rate $\alpha$ is challenging. If $\alpha$ is too large, the algorithm may diverge; if it is too small, convergence can be painfully slow.
II. Local Minima and Saddle Points:
In non-convex optimization, gradient descent may get stuck in local minima or plateau at saddle points, failing to find the global minimum.
III. Ill-Conditioned Problems:
When the Hessian of $f$ is ill-conditioned (e.g., the function's surface is elongated or has ravines), gradient descent may zigzag or converge slowly, requiring techniques like preconditioning or adaptive learning rates.
IV. Dependence on Good Initialization:
Although gradient descent can work from a wide range of starting points, very poor initializations can lead to slow convergence or suboptimal solutions, depending on the function's structure.