Fundamentals: Hessians, Jacobians, and Optimization
I’ve recently been talking to some people about some Deep Reinforcement Learning (deep RL) and while looking up papers and innovations (admittedly for blogging material), I realized that in keeping with the theme of the blog and why I made it, I’d need to start from some fundamentals so bear with me as I discuss these ideas and hopefully give you some intuitive understanding of these concepts.
From the papers I’ve read, the concept of integration isn’t mentioned often so I’ll stick to dicussing Hessians, Jacobians, and some basic Optimization in the context of derivatives, and Machine Learning. Ideally, at the end of this, if you read a paper that mentions one of the aforementioned topics, you’ll have a rough idea of:
1) what the concept is
2) WHY the concept was used in the first place
3) what using it can mean in the context of where they used it.
One of my heroes, Richard Feynman, has this anecdote about knowing the name of something vs knowing the something. Ideally by the end of this blog post, I’d like to impart the second to you.
Ideally, you’re someone who has taken:
1) Calculus (I only took Calc 2 so I don’t know what formal level you’d need, sorry)
2) Linear Algebra
and if so feel free to skip ahead to section 2) Useful Terms. If you’ve not taken those classes definitely read through the following sections where I try to build a solid foundation for continuing the discussion.
0) Derivatives 101
The equation below describes both the equation of a straight line, as well as what happens if you take the derivative of that straight line with respect to some input value:
Typically in a calculus class we’d talk about the rate of change w.r.t . In other words, how much does change as changes. In this case we see that changes by a factor of m as changes.
1) Linear Algebra 101
The numbers as you’re familiar with are called scalar values, e.g 5 and will serve as a building block for the rest of our discussion here.
It can be useful to discuss the scalars in another form: e.g you and I want to meet for coffee but we’re also lazy so we want to meet somewhere in the middle of us both
if I were on 4th street and 5th avenue (4,5) but you were on 10th street and 7th avenue (10, 7) we could meet at
(4 + 10) / 2 = 7
(5 + 7) / 2 = 6
which corresponds to 7th street and 6th avenue (7, 6).
Here I want to focus on two concepts:
1) the concept of abstraction on scalars
We saw in the computation above that it can be tedious to have to write out both equations: what if we had a compact way of representing my location, your location, as well as the operation of averaging to determine where we should meet?
where the top element (4 and 10) represent both our locations in terms of streets, and the bottoms represent our locations in terms of avenues. Congratualtions! We’ve just introduced the concept of vectors!
If we then go further and say that my location is some variable x, and your location is some variable x, we can talk about them in the following way:
where those variables can mean our locations, or really anything we want! Note, that if we add more information e.g: (4, 5, 1) where 1 describes the specific shop I’m in front of, or the corner I’m on we can just extend this idea without needing to write everything out as we did earlier without the abstraction. can be represented either horizontally or vertically, however, we note that it exists only in that shape. Contrast this with something like:
where we need to add another column to encompass the data of my group having two people, and your group having two people. This is called a matrix
Phew, that was a mouthful.
2) The concept of a coordinate system, and what it means for something to be in the coordinate system.
Earlier, we introduced the idea of representing my location in terms of a vector which we say is an element of (discussed in a bit)
Let’s break it down: basically means a whole number (e.g 4, 5 contrasted with which can mean 4.1, 4.2 and so forth). The superscript 2 means that we draw a pair of values from it (4,5) in my case. One important thing to consider is that (4,5) can describe a point on a map, if we added a third dimension (4,5,6) it just describes a point in 3-dimensional space where the values might describe the street, avenue, and height from sea level respectively.
See if you can guess what the following matrix is an element of (hint: it’s a pair of values)
And that’s it for the linear algebra you’ll need for the rest of this post!
2) Useful terms:
1) Vector-valued function: returns a vector
2) Matrix-valued function: returns a matrix
3) Tensor: a scalar value is a 0-order tensor, a vector is a 1-order tensor and a matrix is a 2-order tensor. The abstraction just goes up from there and we’ll come back to this idea later when we try to take the (f)) where our first Jacobian returns a matrix. Note: the inner Jacobian is a matrix-valued function
4) : basically means a point in n-dimensional space. For example, if you drew a cartesian map on a piece of paper, any point you pick on it has an x,y coordinate that describes the point. Thus, we can say that the point exists in . If you restrict the points to taking on “whole numbers” (typically called a Natural number), you can say that it exists in .
2.) Partial Derivatives
This section is a little awkward as it’s not covered in Calculus 101, however, discussing it is extremely important before broaching the topic
: the Jacobian is, in essence, the first derivative of some matrix.
I’ll discuss the concept below, as well as give you a concrete example to guide our discussion.
Example problem: given a single point of data about you (age, height, favorite food), we want to find out how likely it is that you’re in certain clubs: (reading, sleeping) we’ll reference this problem while discussing the Hessian as well.
The Jacobian describes how changing each of the example problems’ input dimensions (you weigh a little more or less, or your favorite food changes) affects each of those output dimensions (e.g: the linear-relation between what your favorite food is affecting how likely you’re in the sleeping club).
Given our 3 inputs, and our 2 outputs we’d have 6 pairs to look at (3 possible things to manipulate for each of those 2 outputs) which will come in handy if you read on.
1) Your input data, x from earlier is in,
2) You have some weight matrix, , that is in
3) Your output, , is in
4) If we were to put our classifier together, defining it by some function, f, it would look like this:
5) If we calculated the Jacobian of this, it would look along the lines of
where we’re iterating through the possible values of x (3 of them), and y(2 of them). This can be expressed more compactly as:
where is a vector describing the column which we partially differentiate.
If we think of our inputs as a point lying on some n-dimensional plane, we can think of our weights as some linear transformation that takes us from . What the Jacobian then gives us is a of how the points are warped in that small area.
Remember, our n-dimensional vector of features is just some point in n-dimensional space; if we take the ‘rate of change’ of something that transforms it into a m-dimensional space ( in our concrete case), we get the amount of linear transform that is happening in the small region around the point (I think it’s called a neighborhood?)
3.4) Where you might see it?
1) Loss function:
By imposing some restrictions on the neighborhood around a point, we can do some interesting work on making the values invariant (or close to) small changes in an area. if the explanation sounds a little hand-wavy, and you’d like a concrete example check out Hugo Larochelle’s Contractive Autoencoder video
2) Discussions about local linearity in non-linear settings:
For example, Neural Networks are famously non-convex (a topic we’ll talk about at a high-level further down, and something I’ll make a separate blog post about because I think convexity is fascinating) but trying to analyze it from a linear standpoint can be useful as well! If you need an example, watch the video above :)
: Basically the derivative of the Jacobian.
In the same way that the derivative of the derivative (2nd derivative) of a scalar describes whether the derivative is increasing or decreasing (convex and concave), the ‘matrix’ equivalent is a description of whether we’re at a local minima, local maxima, or if we haven’t learned anything new from studying that point.
Thus, it makes sense that instead of just using the Jacobian in our gradient descent, we’d also want to use the Hessian to determine where we want to “descend” on.
Remember, a local minima here doesn’t give you a global minima BECAUSE as discussed earlier, we’re just getting the curvature of the local neighborhood
Returning to our example from earlier about the club you’re likely to end up in we’re going to look at this in two different forms, the more succinct one here, and the less succinct one in the low-level section because I think it’s more confusing and requires that we think in terms of an abstract matrix (or tensor).
So, our Jacobian of our first function looks like this:
If we then take the Jacobian of THAT, we end up with the following
that wasn’t too bad, right? One thing to note is how we went up in dimensions: our input was a vector, and our output was a matrix this time (which was not unlike what happened with what we did initially before we made the notation more succinct).
Bear with me for a bit: if we were to expand out our into its components (), we’d need another axis to put them on. So, our Hessian from above would need another axis over which to ‘store’ them. In that case, we can think of adding a ‘depth’ component to our Hessian matrix. Still with me? I hope so because if you are, you’ll understand why:
1) I’m not going to actually list out the ‘matrix’
2) I’ll call the ‘expanded’ version a 3-order tensor
4.4) Other takeaways
We’ve discussed the Hessian, as well as how you list it out, but I think that there are some other concepts here not strictly in the realm of Machine Learning that should be highlighted:
1) If we can reformulate things in terms of other things we’ve already seen, we can use our generalizations that we’ve built off of. For example, when we changed the representation of the separate outputs () into which allowed us to use the same basic formula in the Jacobian, but for the Hessian. More specifically, in that we could represent our Hessian as a matrix (just as we represented our Jacobian) instead of as a 3-order tensor.
2) Or, to reword 1) into a more generalized idea, we can use our dimensions as ‘types’ and our functions as things that operate on other things (as opposed to just operating on values). This allows us to compose things together, as well as let our previous rules that we’ve established guide us easily. In computer science these two concepts are called higher order functions and typing.
4.5) Where might you see it?
1) Convexity of the loss function:
: if A is your matrix, then for any non-zero
if you calculate the Hessian of your loss function and see that it is positive semi-definite, then you know that it is convex which brings about a bunch of very attractive properties (like a guaranteed global minima)
One example of a convex loss function is Logistic Regression, where because we know it is a bowl shape (convex), we know that we are guaranteed global minima if we find the minima
2) Evaluation metric
Observed Information matrix where we’re looking at the negative Hessian of the log-likelihood function.
I’m not going deep into the details about this but if we have some estimated parameters (in statistics they often denote a vector of parameters to be ), one way of evaluating how well fits out data is:
(our loss function)
If we then take the negative Hessian of our loss function based on the vector it tells us how our loss varies as we manipulate different parameters
if we know the curvature of the surface, this can guide us in our gradient descent. There are also some papers that talk about using the diagonals of the Hessian to estimate the optimal learning rate as mentioned in A Robust Adaptive Stochastic Gradient Method for Deep Learning
Admittedly, I’ve not read the paper above in-depth and I’ve not read the papers referenced at all, but I’d wager that in the papers utilizing the diagonals of the Hessian, it allows them to weigh the importance of the different features as they do their gradient descent. I say this because not all features are equally informative so it doesn’t make sense to treat them all the same (especially since your error is typically just a scalar value that you propagate backwards). I may be completely wrong but this example stresses 2 things:
i) read the bloody paper
ii) intuition is only useful so long as it’s right, so it falls to you to make sure you’re correct. Having said that, knowing what to think of or having a general idea of what is happening is useful too
5) Optimization with and without the Hessian
If the words 2nd order Taylor approximation are new to you, do check out 3Blue1Brown’s video which will explain it far more elegantly and beautifully than I could ever hope to :)
5.0 Pre-cursor to Problems with the Hessian
First, it might be useful to discuss HOW we actually use it: the 2nd order Taylor approximation
Typically, when we ‘use’ the Hessian to guide our gradient descent, we’re not ACTUALLY using the Hessian, instead we’re using some approximation of it because calculating the Hessian is a giant pain. However, for the below discussion, let’s assume that we ARE calculating it (before we later go into discussions about optimizations that aim to approximate it, or avoid it completely).
Now that we have this, we progressively update our x with this value
1) Computing it works in order of , which is a problem when you think about how many parameters an algorithm can have. For context, the original AlphaGo had about 4.6 million parameters in its policy network (square that) and often these things are too big to fit into memory.
2) Note also the fact that up above in Newton’s method we need to calculate the inverse of the Hessian, which even in efficient algorithms is on the order of O(), and obviously for large matrices this is going to get messy very quickly.
3) If you’ve watched the video, you know that when we use the 2nd order Taylor approximation, we’re making the assumption that the surface is a quadratic which it might not be (3rd order would be far too computationally expensive, and I can’t think of an intuitive reason as to why we’d want to use it). Admittedly, our solutions below don’t side-step this problem but I believe it’s important to know the strengths and weaknesses of our methods.
5.2 Solution: Estimate … stuff?
Now that we have an idea about the problems surrounding calculating the Hessian, what methods can we use to get around actually calculating it? Numerical approximation is a big thing, isn’t it?
Note that I’m just going to give a VERY high level overview of these methods because I think they deserve articles of their own and to tell you that I can summarize them into a small section of a blog post would be a blatant lie. There are many ways to calculate the estimate of it.
5.2.2 Estimate the Hessian
In the paper I linked above, they mention how in some papers they use various methods to estimate the diagonals of the Hessian.
Depending on your use, you may not need the entire Hessian, so it makes sense that you’d want an approximation to calculate only as much as you need.
Fair warning, the next two bits get hairy VERY quickly and may require more than 1 reading.
5.2.3 Speed up the slow steps
1) BFGS: Broyden Fletcher Goldfarb Shanno
Reduces the complexity by solving Problem 2) in that it estimates the inverse of the Hessian. We still use the same 2nd order approximation, however, now we estimate the inverse Hessian.
Our objective function while estimating the inverse Hessian at time t+1 aims to do the following:
E = Inverse Hessian estimate
W = weights
f = objective function. Because we’re in the training step, we’re inputting W here, instead of some vector x
The equations above are from Reference 4 but reordered to be more in line with symbols we’ve previously described, or to facilitate an explanation. Let’s break this equation down from left to right:
where denotes that we take the weighted Frobenius Norm which makes the result scale-invariant Slide 11.
: normalizes each row (e.g we normalize all the heights of our example problem of calculating the club you would belong in). This way, all our data is on the same ‘scale’, thus preventing one single measurement from dominating everything else.
One silly example is if we chose to record our heights as micro-inches which makes the recorded values stupidly large to the point where they dominate all the other factors. In using this norm, we basically say that any relative amount of change in one factor, is as important as the changes in another factor
Unfortunately, I have no intuition for this although I can point you in the right direction: it’s called the secant condition (Googling didn’t lead me anywhere that enchanced my understanding)
2) L - BFGS: Limited-memory BFGS
Works by storing a low rank version of our estimate, E
: basically a large matrix that isn’t “all” useful. I.e we could remove some of the rows or columns and still be able to make up any matrix that the larger matrix could. Think of it like storing only the important information that tells us something new (that we could not get by combining other bits of information)
5.3 Solution: Forget the Hessian
Instead of calculating the Hessian, or its inverse, we can just use the Conjugate Gradient.
Admittedly, I’m not 100% clear on this but let’s see if my intuition is correct: if we have some hill that is convex (a bowl shape - although it doesn’t need to be a perfect bowl).
We now have a few variables,
M: a matrix that transforms your imperfect bowl into a perfect bowl
xList: the previous gradient(s) / list of gradients that we append old gradients to.
x’: the new gradient
The idea is that if we were to use M to transform our matrix from an imperfect bowl into a perfect bowl, x will take us as ‘far’ down as it can in one dimension. We append x to xList, our seen gradients. The next x, x, should be M-orthogonal to everything in xList (meaning that it might not be orthogonal in our imperfect bowl, but it should be orthogonal in our perfect bowl). Since all the elements in xList are mutually orthogonal, it makes sense that if we have to traverse n-dimensions, it would only require that we find n of such gradients, right?
In an ideal world this would be the case, however, not all problems are convex which means that we now have the problem of either finding such an M, or approximating it.
6) Standing on the Shoulders of Giants
I think that all knowledge comes from ideas, people who formalize those ideas, and people who are willing to check those ideas for themselves and offer constructive criticism. For that reason alone I’d like to thank:
7) Further Readings / References
1) Matrix Cookbook - Page 8-16 - I’m personally not a fan of recommending this off the bat as I think a collection of facts in itself isn’t useful except for as a reference.
2) Zico Kolter’s Linear Algebra Review and Reference - great professor at CMU and I found this guide to be very useful