Optimization methods play a critical role in machine learning algorithms. In this article we try to go through different gradient methods that are used for searching for the optimization value of a target function.

0. Target Function

To illustrate what a target function is, we use linear regression as an example. For 2 dimensional problems, our goal is to get a line that best fits all the data points. The line is represented by the following function: so that such function is minimized: Here we have m points (instances), and J is the target function we want to minimize so that we can get the function h, and h is the line that best fits the data.

The above problem has only one feature x. In fact we can have multiple features. Suppose we have m instances and n features, then the function h we want to get becomes And the target function becomes: Now, how do we minimize the target functions?

The easiest way to optimize the target function is by using a method called Gradient Descent. The idea is to keep moving along or in the opposite of gradient direction so that the direction is of maximal increase or decrease at that point, and hopefully we can get to the optimization. To use gradient descent method, for the above cost function J, for each of the n features we repeat the following step until convergence: Here alpha is called the learning rate , which decides how much we go along the gradient direction during each optimization step.

Now if we still use linear regression as example, plug in the n-dimension version of cost function J, we get the result: Here we give some tricks for this method.

a. Make sure features are on a similar scale

Approximately we want -1 <= x(i) <= 1. We can accept 0 <= x <= 3 or -2 <= x <= 0.5, however cases like -100 <= x <= 100 or -0.0001 <= x <= 0.0001 are called 'poorly scaled'

b. Use mean normalization

We prefer to replace each feature in the following way to make features have approximately 0 mean: Where u is the average value of x in the training set, and s is the range (max - min) or standard deviation.

c. Tune the learning rate

We want to make sure gradient descent is working OK. The value of the cost function J should decrease after each iteration. If this is not the case, for example J increases or vibrates during iterations then we should consider use smaller learning rate. However with very small learning rate Gradient Descent can be very slow to converge.

Gradient Descent is a first-order method because we only use the first derivatives of the function. This is a limitation because we neglect the curvature of the surface of the cost function, and under some circumstances the 'zig-zagging' effect will cause the convergence to be very slow for gradient descent. Next let's talk about Newton's Method which makes use of second-order derivatives. 2. Newton's Method

Before we talk about Newton's method that uses second derivatives, we here review some mathematical concepts that will help us. Note that we do not use any specific cost function now, and denote all the functions we are optimizing as f(x). The function has n variables x1, x2, …, xn, and we denote it as a vector x. b. Hessian Matrix (Second Derivative) For quadratic function, if we write the function in the following format: Then we can prove the gradient of f(x) is Ax + b, and its Hessian Matrix is exactly A.

Now let's talk about Newton's method. Suppose at some step k we arrive at point x(k). We approximate the target function with a second order Taylor expansion about this point: To choose x for the step k+1, we take derivative of such expansion with respect to x and set it to 0: We can get: We introduce the learning rate alpha, and get the Newton's method: For Newton's method, learning rate describes how much we go on the chosen direction, and can be set each time so that the destination gives minimum for the target function along this direction. This is called 'Line Search': Compared to Gradient Descent which just advances towards the minimum, Newton's method steps towards the global minimum under the assumption that the function is actually quadratic and its second order expansion is a good approximation. This allows it to make big steps in low-curvature scenarios (where f′′( x(k) ) is small) and small steps in high-curvature scenarios, so that it may perform better than simple gradient descent that just moves against the gradient.

There are problems for Newton's method:

a. Hessian Matrix needs to be reversible, apparently.

b. If point x(k) is too far away from local minimum, the second order approximation in the Taylor expansion may not be good.

c. Practically, if we have n dimensions, Hessian Matrix is NxN and we need O(N*N) storage space, and also need to compute the reverse of the Hessian Matrix. This takes a lot of storage and computation resources and sometimes is not practical.

In order to solve problem c above, people came up with a set of solutions called Quasi Newton's Methods.

3. Quasi Newton's Methods

We do Taylor Expansion around x(k+1): And we take gradient on both sides: Take x=x(k) in the above equation, we thus have: So that: People use the above equation to come up with many quasi-Newton algorithms to calculate the reverse of Hessian.

a. DFP Formula

William C. Davidon, Roger Fletcher, and Michael J. D. Powell came up with the formula to estimate the reverse of Hessian (we neglect the formula derivation here): b. BFGS Algorithm

Broyden, Fletcher, Goldfarb and Shanno came up with a better formula to estimate Hessian first, and then use Sherman-Morrison formula to estimate the reverse of Hessian: However, even we now do not need to compute the matrix reverse operation, we still have to store D(k) during our computation, which is a NxN matrix. Such matrix can take up quite big space.

c. L-BFGS Algorithm

People improve the BFGS algorithm to a limited memory version called L-BFGS algorithm. In this algorithm, we only store the m newest {s(i)} and {y(i)}, so storage reduces from O(N*N) to O(m*N).

Suppose we already calculated D(m) as of function of s(0) , …, s(m-1), y(0), …, y(m-1). When we now calculate D(m+1), we will discard s(0) and y(0), and add in s(m) and y(m) into the storage, and approximately calculate D(m+1) as a function of s(1), …, s(m), y(1), …, y(m).

More generally, we approximately calculate D(n) so that: In this way we can use O(m*N) storage to calculate D for the each optimization iteration step as a replacement of the reverse of Hessian in our Newton's method.

m is a parameter that we can set depending on the computation memory. We don't cover the tedious derivation of the function g in this article.

Conjugate Gradient can be considered as an optimization method in between Gradient Descent and Newton's Method. In CG we only need to calculate the gradient instead of Hessian and its reverse, and it performs faster than Gradient Descent method. Our intuition is that, when we decide the direction d(1), we move to the minimum position along this direction x(1) using line search, and decide next direction d(2), such that no matter how far we go on d(2), the gradient on d(1) will not change. This will ensure that the effort we make during line search step on d(1) is not ruined by the movement on d(2).

So, what is the relationship between d(1) and d(2)?

We use quadratic function as an example. Suppose x´ is the local minimum, we do Taylor expansion to the second order and calculate the gradient: Since x´ is the local minimum, the gradient on that point is 0. We calculate the normal vector at point x(1): At x(1) the normal vector is perpendicular to any vector on the isohypsic surface, and d(1) in on the surface, so we have: Plug in the previous equation we have We can see that d(1) and d(2) are conjugate around matrix A.

So for Conjugate Gradient, generally we have current location x(k) and direction d(k), we first get x(k+1) along the direction of d(k) using line search: Here g(k) is the gradient of f(x) at x(k).

Next we try to find d(k+1). Suppose: We remember the conjugate relationship between d(k) and d(k+1), so we have: Thus we get: In this way we calculate d(k+1), and repeat the process until g(k)=0.

For N dimensional quadratic function, theoretically we can get the minimum after N iterations. For non-linear functions, we also use the above algorithm and repeat the steps, during which step we actually approximate the local surface with a quadratic surface.

Note that for the Conjugate Gradient to work, using line search to find x(k+1) from x(k) is a must. A constant 'learning rate' will not work for this algorithm.

For non-linear functions, matrix A changes has to be calculated at each step. There are many solutions to approximate A. One is called Fletcher Reeves (FR) formula which directly calculate Lamda and Beta according to the following equation:  5. Hessian-Free Optimization

Now we can talk about the more superior Hessian-Free optimization.

From the current point x(k), we use Taylor expansion about x(k) to approximate the original function: For each Phi(k), we use Conjugate Gradient to calculate x(k+1), and repeat until x(n) converge.

But why is such method called 'Hessian-Free', since it seems we still have to calculate Hessian in the last item of the above equation?

In fact we do not need to calculate the exact Hessian when it is multiplied by another vector. Let's consider the i-th row of Hv, where H is the NxN Hessian matrix and v is a Nx1 vector: So Hv can be easily calculated as: In this way we do not need to calculate Hessian directly, but still get to make use of Hessian for our optimization.

REFERENCE