Last modified: February 04, 2023
This article is written in: 🇺🇸
Gradient Descent is a fundamental first-order optimization algorithm widely used in mathematics, statistics, machine learning, and artificial intelligence. Its principal aim is to find the minimum of a given differentiable function $f(x)$. Instead of searching blindly, it uses gradient information — the direction of steepest increase — to iteratively move toward a minimum by taking steps in the opposite direction, known as the steepest descent direction.
Because it only requires the gradient of the function rather than its Hessian or other higher-order derivatives, gradient descent is often the method of choice for large-scale optimization problems. For instance, in machine learning, gradient descent is the backbone of training many models, from linear regression to deep neural networks.
Conceptual Illustration:
Imagine the function $f(x)$ as a landscape of hills and valleys. Gradient descent resembles a blindfolded hiker who can feel the slope of the ground beneath their feet. The hiker takes small steps in the direction of the steepest downward slope to eventually reach a valley (a minimum of $f$).
In higher dimensions (e.g., $x \in \mathbb{R}^n$), this idea generalizes similarly, just with gradients represented as vectors.
Given a differentiable function $f : \mathbb{R}^n \to \mathbb{R}$, we want to solve:
$$\min_{x \in \mathbb{R}^n} f(x)$$
Gradient Descent Update Rule:
At iteration $n$, we have the current approximation $x_n$. The gradient at $x_n$, denoted $\nabla f(x_n)$, points in the direction of greatest increase of $f$. To move towards a minimum, we take a step opposite to the gradient direction:
$$x_{n+1} = x_n - \alpha \nabla f(x_n)$$
where $\alpha > 0$ is the learning rate (also called the step size).
Main Components:
Over successive iterations, provided the learning rate is chosen suitably and the function is well-behaved (e.g., convex), $x_n$ converges to a point $x^$ where $\nabla f(x^) = 0$, indicating a critical point, often a minimum.
I. First-Order Taylor Expansion:
Consider the first-order Taylor expansion of $f$ at $x_n$:
$$f(x_{n+1}) \approx f(x_n) + \nabla f(x_n)^\top (x_{n+1}-x_n)$$
To reduce $f$, we want $x_{n+1}$ to move in a direction that decreases this linear approximation. The direction that maximally decreases $f$ is the opposite of $\nabla f(x_n)$.
II. Steepest Descent:
The direction of steepest descent is given by $-\nabla f(x_n)$. To decide how far to move in that direction, we introduce the learning rate $\alpha$. Thus, the next iterate is:
$$x_{n+1} = x_n - \alpha \nabla f(x_n)$$
III. Convergence Considerations:
Under suitable conditions (e.g., if $f$ is convex and $\alpha$ is sufficiently small), the sequence ${x_n}$ generated by gradient descent converges to a global minimum:
$$\lim_{n \to \infty} x_n = x^*$$
where $x^*$ is a minimizer of $f$.
Input:
Initialization:
Set $n = 0$.
Iteration:
I. Compute the gradient at the current point:
$$g_n = \nabla f(x_n).$$
II. Update the point:
$$x_{n+1} = x_n - \alpha g_n.$$
III. Check for convergence:
Output:
Given Function:
$$f(x) = x^2.$$
We know $f$ is minimized at $x = 0$. Let’s apply gradient descent step-by-step to illustrate how the algorithm converges to $x=0$.
Setup:
Iteration 1:
Update:
$$x_1 = x_0 - \alpha f'(x_0) = 5 - 0.1 \cdot 10 = 5 - 1 = 4.$$
Iteration 2:
Update:
$$x_2 = x_1 - \alpha f'(x_1) = 4 - 0.1 \cdot 8 = 4 - 0.8 = 3.2.$$
Iteration 3:
Update:
$$x_3 = x_2 - \alpha f'(x_2) = 3.2 - 0.1 \cdot 6.4 = 3.2 - 0.64 = 2.56.$$
Continuing this process, we observe that $x_n$ keeps getting closer to 0. Over many iterations, the point will approach the exact minimum at $x = 0$.