This section builds off the last post, Fundamentals Part 1: An intuitive introduction to Calculus and Linear Algebra; if you’re not familiar with calculus or linear algebra, I highly recommend starting there. If this is your first time seeing all of this, know that this section is more involved than the first fundamentals post. Be prepared to feel a little lost, but if you keep at it, I know you’ll get there (it took me a while to wrap my head around)

For each of the topics covered, Jacobian and Hessian, I try to provide 3 levels of information: a high level, a mid-level, and a low level for you to review, depending on your level of interest.

0) x describes a scalar value, $\vec{x}$ describes a vector, and **X** describes a matrix.

1) Vector-valued function is a function that returns a vector.

2) Matrix-valued function is a function that returns a matrix.

3) Tensor: a scalar value is a 0-order tensor, a vector is a 1-order tensor, and a matrix is a 2-order tensor. For the purpose of most Machine Learning applications, a tensor is just an n-th order tensor (more abstractions). We’ll come back to this idea later when considering the not-yet-defined Jacobian and Hessian.

4) $\mathbb{R}^n$: basically means a point in n-dimensional space. For example, if you drew a Cartesian map, any point you pick has an (x,y) coordinate that describes it. Thus, we can say that the point exists in $\mathbb{R}^2$. If you restrict the points to taking on “whole numbers” (aka Natural numbers, or counting numbers), you can say that it exists in $\mathbb{N}^2$.

This section is a little awkward as it’s not covered in Calculus 101; however, discussing it is extremely important before broaching the rest of the blog post. A partial derivative is basically a derivative of a “part” of a multivariable function, i.e., we take the derivative along a single dimension while keeping all others constant.

A Jacobian and a Hessian are just derivatives of the first and second-order multivariate functions, i.e., applied once and twice, respectively.

The Jacobian is, in essence, the first derivative of some tensor. We begin with the following example: given a single point of data about you (age, height, favorite food), we want to find out how likely it is that you’re in certain clubs: (reading, sleeping) we’ll reference this problem while discussing the Hessian as well.

The Jacobian describes how changing each of the input dimensions affects each of the output dimensions. Looking at our example, changing how much we weigh or how our favorite food changes, we can observe the linear effect on our clubs.

Given our 3 input dimensions and our 2 output dimensions, we’d have 6 pairs to look at (3 possible things to manipulate for each of those 2 outputs). This intuition will come in handy if you read on.

Consider our example from earlier:

1) Your input data, **X** $\in \mathbb{R}^{1 \times 3}$.

2) You have some weight matrix, **W** $\in \mathbb{R}^{3 \times 2}$

3) Your output, $\vec{y}$ $\in \mathbb{R}^2$

4) If we were to put some classifier algorithm, defining it by some function, $f$, it would look like this:

\[\vec{y} = f(\vec{x}) = W\vec{x}\]5) If we calculated the Jacobian of this, it would look along the lines of

\[\textbf{J}(f) = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3}\\ \end{pmatrix}\]where we’re iterating through each dimension of **X** (3 of them), and $\vec{y}$ (2 of them). This can be expressed more compactly as:

where $\vec{y}$ is a vector describing the vector which we partially differentiate.

If we think of our inputs as a point lying on some n-dimensional plane, we can think of our weights as some linear transformation, $f$, that takes us from our current point $\mathbb{x} \rightarrow \mathbb{y}$. What the Jacobian then gives us is the best $\textbf{linear local estimation}$ of how the points are warped in that small area.

Remember, our n-dimensional vector $\in \mathbb{R}^{1 \times 3}$ of features is just some point in n-dimensional space. If we take the ‘rate of change’ of something that transforms it, **W**, in our concrete case, into an m-dimensional space, we get the amount of linear transform in the small region around the point.

1) Loss function:

By imposing some restrictions on the neighborhood around a point, we can do some interesting work on making the values invariant (or close to) small changes in an area. if the explanation sounds a little hand-wavy, and you’d like a concrete example, check out Hugo Larochelle’s Contractive Autoencoder video.

2) Discussions about local linearity in non-linear settings:

Neural Networks are known to be non-convex, but analyzing them from a linear standpoint can still be useful. I’d suggest watching the video above for an example of how it can be beneficial to analyze in this way.

**Note**: I’m hoping to talk about convexity down the line as it is a fascinating topic.

The Hessian is essentially the derivative of the Jacobian.

In calculus 1, you might have learned that the derivative describes the rate of change, and the derivative of the derivative describes the maximum/ minimum. The Hessian is the equivalent of the concept mentioned above but applied to N-dimensional tensors in an abstract sense.

Recall our Jacobian function from earlier:

\[\textbf{J} (f) = \nabla f \begin{pmatrix} \frac{\partial \vec{y}}{\partial x_1} & \frac{\partial \vec{y}}{\partial x_2} & \frac{\partial \vec{y}}{\partial x_3}\\ \end{pmatrix}\]If we then take the Jacobian of THAT, we end up with the following:

\[\textbf{J}(\textbf{J} (f)) = \nabla (\nabla f) \begin{pmatrix} \frac{\partial^2 \vec{y}}{\partial x_1^2} & \frac{\partial^2 \vec{y}}{\partial x_2 \partial x_1 } & \frac{\partial^2 \vec{y}}{\partial x_3 \partial x_1}\\ \frac{\partial^2 \vec{y}}{\partial x_1 \partial x_2 } & \frac{\partial^2 \vec{y}}{\partial x_2^2} & \frac{\partial^2 \vec{y}}{\partial x_3 \partial x_2 }\\ \frac{\partial^2 \vec{y}}{\partial x_1 \partial x_3} & \frac{\partial^2 \vec{y}}{\partial x_2 \partial x_3 } & \frac{\partial^2 \vec{y}}{\partial x_3^2}\\ \end{pmatrix}\]An interesting tidbit that the eagle-eyed among you may have realized is that we went up in dimensions from a compact vector representation to a compact matrix representation. Intuitively this makes sense as we are now varying our variables on our variables (hence the denominators like $\partial x_1 \partial x_2$)

Bear with me for a bit. If we were to expand out our $\vec{y}$ into its components ($y_1, y_2$), we’d need another axis to put them on. So, our Hessian from above would need to “expand” into another dimension to store them. Still with me? I hope so because if you are, you’ll understand why:

1) I’m not going to actually list out the ‘tensor.’

2) I’ll call the ‘expanded’ version a 3-order tensor

3) When we differentiate a vector with regards to a vector, we increase dimensionality. See Old and New Matrix Algebra Useful for Statistics for a summary of the different forms of differentiation. Also, Wikipedia: Matrix Calculus Layout Conventions has some interesting notes.

**Positive Semi-definite**: if A is your matrix, then for any non-zero $\vec{x}$ \(\vec{x}^T A \vec{x} \geq 0\)

If we calculate the Hessian of our loss function (I’d suggest going online and working through one of the proofs), we see that it is positive semidefinite, which means that it is convex. This brings about the property of a guaranteed global minimum; reaching it in a method like gradient descent is another matter.

One example of a convex loss function is the logistic, where because we know it is a bowl shape (convex), we know that we are guaranteed global minima if we find the minima

Observed Information matrix where we’re looking at the negative Hessian of the log-likelihood function.

I’m not going deep into the details, but if we have some estimated parameters $\theta$ (also called the weights in ML) , one way of evaluating how well $\theta$ fits our data is by first taking the **log-likelihood**:

Taking the negative Hessian of our log-likelihood tells us how our loss varies as we manipulate different parameters.

If we know the curvature of the surface, this can guide us in our gradient descent. Some papers talk about using the diagonals of the Hessian to estimate the optimal learning rate as mentioned in A Robust Adaptive Stochastic Gradient Method for Deep Learning

**Intuition**

Admittedly, I’ve not read the paper above in-depth, and I’ve not read the papers referenced at all. Still, I’d wager that utilizing the diagonals of the Hessian allows them to weigh the importance of the different features as they make their gradient descent. I say this because not all features are equally informative, so it doesn’t make sense to treat them equally (especially since your error is typically just a scalar value that you propagate backward). I may be completely wrong, but this example stresses 2 things:

i) read the paper

ii) intuition is only helpful so long as it’s right, so it falls to you to make sure you’re correct.

1) Matrix Cookbook - Page 8-16 - I’m personally not a fan of recommending this off the bat as I think a collection of facts in itself isn’t useful except for as a reference.

2) Zico Kolter’s Linear Algebra Review and Reference - great professor at CMU, and I found this guide to be very useful.

4) Old and New Matrix Algebra Useful for Statistics