The Matrix/Tensor Calculus of Neural Networks

1. Background

In sophomore year of high school I took a course called Advanced App Development, a course which was essentially a series of small projects with varying levels of guidance. It was a great class, and I loved that it could be as easy or as challenging as I wanted.

For one project, our teacher gave us a version of Ms. PacMan which had been coded in Java. We did a bit of exploring in the project, but then Mr. Ruth turned us loose with the only initial instruction being to do something cool with it.

Not long before this I had watched a great video by SethBling, a former Microsoft employee. In the video he describes an AI he wrote which learns to play Mario games. I'm sure you've already figured out where my mind went when I was trying to think of a PacMan related project, but it's a little more nuanced than that.

I started watching all of the videos I could find related to AI and machine learning (terms which were indistinguishable to me at the time). It was not long before I found this video series by 3Blue1Brown, but after trying to watch it three times, I still was no closer to understanding neural networks.

I soldiered on, however, and eventually decided that it would be more feasible to use an API, so I started researching TensorFlow. I still didn't understand that, and it probably didn't help I had no idea what a tensor was.

So for the second time, I lowered the difficulty level and ended up using Keras, an even higher level API. The challenge then was in feature selection and connecting the neural network (which was in Python) to the game (which was in Java). Eventually I ended up using secure sockets to communicate (with the network acting as a server and the game as a client).

Next I recorded a bunch of my own gameplay by saving the features I had chosen to a text file, and I trained the network. It was ok. The AI played better than a random player by a significant amount, but it was nowhere near as good as I was, and even that wouldn't be saying much.

But to me that doesn't matter, and when I think back about that project now it brings a smile to my face because it reminds me of how far I've come.

2. Take Two

The next time I decided to work on something related to neural networks was the summer before my senior year of high school. Using a neural network had been fun, but for me to truly understand something I usually have to at least partially recreate it so I can try to understand all of the design decisions which make it possible.

I returned to the video series on by 3Blue1Brown and started taking detailed notes, this time armed with knowledge of calculus. For a week I struggled, making no progress towards understanding backpropagation until I had a "Eureka!" moment.

For some reason it just hadn't clicked that you could use the partial derivative with respect to the activations of one layer to get the partial derivative with respect to the activations of the previous layer. Once I had that down, everything else followed easily.

At least until I tried to build one, that is. You see, I hate doing things halfway, and while that extra ambition to dive straight into something can be valuable, in this case it held me back from success. I wanted to make not just a network, but a framework for creating networks like Keras.

At this point I hadn't learned linear algebra (that would wait until the coming school year) so I did everything imperatively with loops. Somewhere along the line I made a mistake, and when I took a step back and looked at the code I had written, I knew it would be almost impossible to find it.

This was when I learned one of the most valuable lessons I believe you can learn: when working with complex ideas you should try to keep your code simple. I hadn't blocked my code into different functions, nor was my code particularly well commented, and the biggest problem (in retrospect) was that I had rushed into writing code before I had tried to learn about matrices, using the crutch of for-loops to save me from having to learn anything else.

3. More Math, Less Code

Fast forward about half a year, and I was sitting in my room alone, unsure of what to do. The school musical Matilda had been canceled, in which I had played trombone, and in fact all of school had been canceled. Like many people during the fledgling Covid-19 pandemic, I found myself with quite a bit more free time, a jarring contrast following the most intense week of my year.

Since I was still in high-gear but had little to no school work and no musical to practice endlessly for, I threw myself back into my neural network explorations. This time I had not only calculus but also half of linear algebra under my belt, and I would not be defeated a second time.

Whew, that was a lot of background, but I hope you found it entertaining. Now onto the real stuff.

3.1 A Gentle Introduction to Vector Calculus

While we were learning about matrix-vector multiplication, I couldn't help but realize that it was exactly how weight multiplication works in neural networks. Multiplying a MxN weight matrix by a Nx1 activation vector and adding a Mx1 bias vector is the basic engine of the neural network. Or maybe it would be more accurate to say that it is a single piston in the metaphorical engine.

It didn't take long for me to rewrite (and mentally redefine) neural networks and their forward operations in terms of matrices. Slap on a non-linear activation function and you're done.

A quick note to those who might be curious about why you need a non-linear activation function. There are actually several reasons, but the simplest can be discovered by doing the following:

Take two linear functions like a = bc + d and y = wx + z where b, d, w, and z are constants as you would expect.
Substitute y for c, and simplify. You get a = b(wx + z) + d = wbx + bz + d.
Recognize that (wb) and (bz + d) are both constants. Rewrite them as j = wb and k = bz + d.
Our new function is a = jx + k, which is just a linear function like the one we started with! So then what would be the point of compound linear functions? You've done all of these computations but instead you could have just written one linear function!

A neural network using linear activation functions, no matter how many layers it has, might as well only have one layer because (as you have seen) compounding linear functions results in a single linear function. You would spend tons of resources training the weights and biases for all the different layers only for your model to be worse or equal to a single layer network.

Getting back on track, forward operations in terms of matrices are easy, what about backpropagation? Well, normally the way we do that is through taking derivatives, or rather partial derivatives. Can we still do that with matrices?

The criteria for a function to be differentiable on an interval is that it is existent, continuous, and has no corners or cusps. Basically, it must have a limit at every point. Linear functions like the ones discussed above are differentiable because they meet these criteria. They are continuous everywhere and they do not have any corners or cusps. They are, in fact, straight lines. Revolutionary, I know.

So if we could somehow rewrite these neural network matrix operations in terms of "normal" linear functions, it would be rather straightforward to prove that the corresponding matrix operations are differentiable, right?

I've actually found this to be a running theme in this field. Dealing with things like matrices and vectors can be tough, so it's nicer to rewrite things in terms of summations. Here's how:

math

This matrix and vector operation can be rewritten in terms of only scalar values through the use of a summation:

math

Let's take a second to unpack what this means. The key to understanding this notation is realizing that the left hand side of the equation represents the j-th element of the vector Z, so the right hand side of the equation represents the calculation necessary to find the j-th element of Z.

Try to find the first element of Z using normal matrix operations such as those shown in the first image. Due to the variably defined size of the matrix W, you'll probably want to use a summation when "dotting" the first row of W with A. Then you just add the first element of B, and you're done!

What you've just created should look very similar to the second image above, except every j has been replaced by a 1. If you were to repeat this process trying to find the m-th element of Z, you would find that the underlying operations are exactly the same. If you then try to find any element of Z, which we say is the j-th element, you would derive exactly the equation in the second image.

So while our equation only shows the calculations necessary to find one element of Z, we could use it to find any element of Z. If we wanted to we could write Z like this:

math

This isn't as space efficient and would look even more confusing, so we won't be using this notation.

If you remember we are trying to find a way to take partial derivatives of these kinds of operations (matrix and vector operations), and now that we have an equation which is in terms of scalar values (numbers, not vectors or matrices), we can do this easily! All we need to remember is that derivative operators distribute over summation. For the sake of example, let's try the easiest partial derivative, the partial derivative with respect to the k-th element of B.

math

Are your eyes bleeding yet? Don't worry, we'll introduce a much better notation soon, but in the meantime let's break down what we just solved for. The last line represents the calculations you would have to do in order to calculate the partial derivative of the j-th element of Z with respect to the k-th element of B. Taking a step back, this is part of the bias gradient that we need to find in order to train the biases of our neural network.

Notice that all of the partial derivatives act on known scalar quantities/variables (like the ji-th element of W) and are with respect to known scalar quantities/variables (like the k-th element of B). These aren't matrices and vectors anymore, these are numbers.

Even so we are left with two problems. First, how do we take the derivative of something like the j-th element of B with respect to the k-th element of B? Well, it's actually not as hard as it sounds! Each element of B is essentially a different variable. The partial derivative of one variable with respect to a different variable is zero (remember that's how partial derivatives work), and the partial derivative of a variable with respect to itself is one. See? Easy!

math

If both j and k are the same, then we are referencing the same entry in B, and thus the same variable, meaning we are taking the partial derivative of a variable with respect to itself, which is always one.

I've used the vector B here as an example, but this would be true for any vector. There's actually a name for this, but we'll get into that a bit later. For now let's just try to figure out how we would calculate the other partial derivatives:

math

The partial derivative of the i-th element of A with respect to the k-th element of B is actually even easier because you're never going to be taking the derivative of a "variable" with respect to itself (regardless of what i and k are, the elements of A and B are unrelated), so the result will always be zero. The same is also true for the partial derivative of the ji-th element of W with respect to the k-th element of B; this value will also always be zero!

So then we can actually simplify the above equation even further:

math

This is great, but it isn't quite what I promised. You can now find the partial derivative of any element of Z with respect to any element of B. If we just needed to implement this one partial derivative it would be easy, we'd just need a double for-loop to iterate over the indices j and k. But what actually is the partial derivative of Z with respect to B? Not the partial derivative with respect to specific elements, but the vectors themselves? Can you take the derivative of something like a vector with respect to a vector?

For me this was the big question going into this project, and the one which took the longest to answer. Ultimately the answer is yes; you can take the derivative of any tensor with respect to any other tensor.

3.2 Wait, What's a Tensor?

First I should explain what a tensor is. You likely already know what scalars are: they're numbers like 3, 1005, x, 5y, and the result of operations like 6 + 3 or 712(86). Something that can be described as a single value is a scalar as opposed to a vector which we usually describe in terms of two or more values.

The more stringent definition of a scalar is that it is a member of a field, while vectors are members of vector spaces. All fields are vector spaces, but not all vector spaces are fields. In this way vector spaces are a kind of generalization of fields which have to meet fewer requirements. This is similar to the whole, "All squares are rectangles, but not all rectangles are squares," thing.

If you think about it, you could have a vector with only one element, and it would behave indistinguishably from a scalar. In much the same way, you could have a matrix with only one column and it would behave identically to a column vector. See where this is going?

So if squares are scalars, rectangles are vectors, and parallelograms are matrices, what are quadrilaterals in our analogy? Tensors! A tensor is a generalization on matrices in the same way that a matrix is a generalization on vectors in the same way that a vector is a generalization on scalars.

Here's another way of looking at it. Remember earlier when I wrote out all those vectors and matrices with subscript indices? It should be rather obvious why the vectors have only one index (the elements are laid out along a line) while the matrices have two indices (the elements are laid out in a grid). So then do tensors have three indices? Is that the connection?

Close. A tensor can actually have between zero and an infinite (whole) number of indices. This concept can be confusing at first because it's seemingly impossible for us to picture a tensor with more than three indices. Here's another analogy you can try, but you won't get far with it. Think of a scalar as a point, a vector as a line, a matrix as a square, and a 3rd order tensor (one with three indices) as a cube. What comes after a cube?

There's a much better way to think about higher order tensors than trying to picture them in terms of geometrical objects like in the above analogy. Think about a vector as a collection of scalars:

math

Now think about a matrix (a 2nd order tensor) as a collection of vectors, or perhaps a vector of vectors:

math

Under this model a 3rd order tensor is no longer like a cube of numbers, but simply a vector of matrices, or a vector of vectors of vectors. Maybe that doesn't sound simpler at first, but it gives us a way to talk about higher order tensors as vectors of smaller tensors which are in turn vectors of smaller tensors etc. If you're still confused about this, look at the way NumPy deals with tensors with it's array function. In the examples at the bottom of their page you can see how a 2x2 matrix is represented as an array of arrays.

3.3 Tensor Derivatives

Now that we know more-or-less what a tensor is, we can start talking about tensor derivatives. In section 3.1 we went over one of the simpler tensor derivatives: the derivative of a vector (1st order tensor) with respect to itself. We also touched briefly on the derivative of a vector with respect to an unrelated vector, as well as the derivative of a matrix (2nd order tensor) with respect to an unrelated vector.

Before we can get into the well and truly complicated stuff, we have to take one more step back. If you've already taken a multivariable calculus course then you probably learned about gradients. When you have a function with several independent variables (inputs) and one dependent variable (output), you can find what's called the gradient, which is a vector.

Each element of the gradient vector is the partial derivative of the dependent variable with respect to a different independent variable. Here's what this might look like with three independent variables:

math

Why is this vector horizontal when all of the other vectors have been vertical until now? There's actually a really interesting answer behind that question having to do with the gradient vector being a covariant vector (or covector) as opposed to all of the contravariant vectors we have discussed so far. You can read more about this here because I don't fully understand it, and this distinction can be avoided in our case.

This form is usually how the gradient looks in multivariable calculus courses which are still oriented around physics students who likely won't have to work with more than three dimensions, but it is less useful to us. Rarely will our layers only have three or fewer neurons, so we need a more general way of looking at gradients which does not rely on variables like x, y, and z.

We have said that the function f has three independent, scalar variables, but we could just as easily redefine f to be a function of a single independent, vector variable. Let's try this:

math

The subscript i should remind you of how we wrote vectors earlier in terms of accessing their values with an index. The idea is that instead of writing a partial derivative for each component of v (like we did with x, y, and z) we instead write one generalized partial derivative. This generalized partial derivative then represents the vector of partial derivatives with respect to each component of v.

How can we tell that this generalized partial derivative is a vector? It has a single index ( i ), meaning it is a first order tensor. To reiterate, the number of indices tells you the order of the tensor. It doesn't matter how many values are in the set i represents. In this case i ranges from 1 to 3, but we could just as easily have it range from 1 to 1000.

So that's the gradient in a nutshell. You could say that it represents the derivative of a scalar with respect to a vector. It will have as many elements as there are elements in the vector you are taking the derivative with respect to.

Next up is the Jacobian matrix. When I first read about the Jacobian it blew my mind. Let's say that instead of having a function with one scalar dependent variable (the output), you had a function with a vector dependent variable. With three independent and three dependent scalar variables, that might look something like this:

math

What might this function look like? Maybe it just adds one to every input. Maybe it multiplies each input by a different number. Maybe it multiplies the input vector by a change-of-basis matrix. What it does is not important. What is important, is that we are able to take derivatives of functions like this one, and that's where the Jacobian comes in. Here's how we would write the Jacobian matrix for this simple example:

math

It's a little hard to tell from the image, but the vertical axis in the matrix (what we typically think of as the first axis) corresponds to the elements of r, while the horizontal axis (the second axis) corresponds to the elements of v. Each row is like the gradient with respect to a different element of r.

This is the critical understanding which becomes essential going forward: The Jacobian matrix is like a column vector of gradient (row) vectors. Again, whether something is a contravariant column vector or a covariant row vector does not matter, so we can actually simplify this and say that the Jacobian matrix is a vector of gradient vectors.

We can use the Jacobian matrix to find the "derivatives" of vectors with respect to vectors in the same way that we can use gradients to find the "derivatives" of scalars with respect to vectors. In a way, this is the same square-rectangle analogy as when we were exploring tensors earlier; gradients are like single-row Jacobians.

Now that we know generally what a Jacobian matrix is, let's generalize this to find a Jacobian tensor which we could apply to any function with a tensor input and a tensor output. Surprisingly, this turns out to be quite easy. When we took the derivative of a vector with respect to a vector, we said that the answer would have two axes/indices (j and i) and we said that the first axis corresponded to the vector in the numerator (r) while the second axis corresponded to the vector in the denominator (v).

When we take the derivative of a tensor A with respect to a tensor B, the resulting Jacobian tensor has first the indices of A and then the indices of B. What do I mean by this? Let's say that the tensor T is equal to the derivative of the matrix M with respect to the third order tensor S. We might write that like this:

math

See how the indices of the "bottom" tensor are added on after the indices of the "top" tensor? This is also the same order in which you would access the indices in a language like python:

T[i][j][k][l][m] #the ijklm-th element of T

And just to make sure there is no confusion, the ijklm-th element of T is the partial derivative of the ij-th element of M with respect to the klm-th element of S.

Great! There's one more thing we need before we can get back to the neural networks, but I promise I'll keep this one short!

3.4 Einstein Notation

Now we know generally what tensor derivatives are, but we are not yet equipped to apply them to functions containing tensor operations like addition between two vectors or matrix-vector multiplication. For that we need the Einstein summation convention. Einstein notation allows us to express multiplication between tensors in a more rigid and concrete way.

I believe that much of the confusion students learning about matrix multiplication experience stems from the fact that the normal matrix notation that we all learn is not very concrete. We are taught that you, "Dot the rows of the first with the columns of the second." Not only is this confusing, but it is not at all helpful for multiplying higher order tensors, and nowhere in the notation is this convention made obvious. It's one of those things that you just have to remember because you can't tell from looking at two matrices written next to each other how you would multiply them.

Not only that, but what if we wanted to dot the columns of the first with the rows of the second for whatever reason? What if we wanted to multiply two vectors together elementwise instead of dotting them? What if we wanted to dot the rows of the first with the rows of the second? If we want to preform these kinds of operations in normal matrix notation we need to start transposing and defining new operations such as the Hadamard product.

Einstein notation gives us a much better way of expressing these operations. Here's what "normal" matrix multiplication looks like in normal Einstein notation:

math

Why are some of the indices written as superscripts instead of as subscripts? Upper indices indicate a contravariant axis while lower indices represent a covariant axis. When an index appears as both an upper and a lower index, we sum over that index. I just wanted to include an example of standard Einstein notation so don't worry if this doesn't make sense yet. We are actually going to be using the modified version of Einstein notation found in this paper. In this modified Einstein notation we would write matrix multiplication like this:

math

Is that subscript on an operation? Yes. The first group of indices are those of the tensor on the left. The second group of indices are those of the tensor on the right. The third and final group of indices are those of the resulting tensor. You'll notice that j is present in the first two groups but not the third. Why is this? Let's rewrite matrix multiplication one last time:

math

Starting to make sense now? When we multiply the matrices A and B we are actually summing "over" the dummy index j. In Einstein notation the summation is implied by the fact that the same dummy index (in this case j) is written twice. The modified Einstein notation just re-orders the way we write these things and explicitly states that we are summing over j; it's absence from the third group means that it was summed over. This is formally referred to as contraction.

Here are some more examples of things written in both normal matrix notation as well as our modified Einstein notation:

math

Note that in order to add two tensors their indices have to be the same. Take a moment and think about why that might be. If an index runs over a specific range, like 1 to n, then two tensors which have the same indices also are of the same shape and therefore can be added.

That's pretty much it for this section. If you're still confused then the paper I linked is a great resource, and you can find many more resources of varying quality on places like Wikipedia and in posts on sites like StackExchange.

4. The Tensor Calculus of a Neural Network

We finally are at the point where we can start applying what we've learned so far to neural networks and backpropagation! Remember the weighted sum calculated by multiplying the weight matrix by the activations of the previous layer and then adding the bias vector? If you've watched the 3blue1brown video (or you are otherwise familiar with neural networks) you'll know that we need the derivative of this weighted sum with respect to the weight matrix, the bias vector, and the activation vector from the previous layer. Let's start with the derivative with respect to the bias vector since we sort of already did that one:

math

You probably have some questions which I shall now try to answer. Firstly, the zeros that you see represent tensors for which every element is zero. Secondly, yes, tensor calculus does use the product rule that we all know and love, and it works exactly like you might expect it to. Thirdly, we used k because the B that we are taking the derivative with respect to is a different "instance" of the B that is in the equation; if we did not do this, then our answer would have indices ii which cannot be. Finally, that character in the last line is a Greek delta, and in this context it represents the Kronecker Delta, which is defined as follows:

math

If you have taken linear algebra already, you might realize that the Kronecker Delta is equivalent (in this context) to the identity matrix, the matrix with ones on the diagonal and zeros everywhere else.

Now let's try another derivative:

math

The only thing of note here is that multiplying the weight matrix by the Kronecker Delta yields the weight matrix since the Kronecker Delta is functionally identical to the identity matrix. Now for the hardest derivative:

math

I suppose I should address the elephant in the math. The craziness in the first half of the last line is what happens when you take the derivative of a matrix with respect to itself. This is similar to when we took the derivative of a vector with respect to itself and got the Kronecker Delta (which is has the value one only where the indices have the same value). Except instead of having two indices, we now have four, and the only time the output should be one is when i equals k and j equals l. That leads us to the tensor in the equation:

math

Well, that's the hard part done! The only thing left to do is to look at the partial derivative chain that allows us to do backpropagation. In this example I will use X to represent an arbitrary tensor which we may want to calculate the derivative with respect to. This could be a weight matrix, a bias vector, or an activation vector. C is the cost function. And just like in 3blue1brown's video, the upper indices represent the layer of the network the tensor is from.

The number of indices of the tensor X is not known (it could be one if it is a vector or two if it is a matrix) so we use s to represent the set of its indices. Also note that for the sake of brevity I am not including the subscripts for the multiplication operators because it should be fairly obvious which indices are being summed over.

math

This is an application of the chain rule. One way to think of this is that the denominator of one partial derivative cancels with the numerator of the next. If you follow this through, it should become apparent that the right side of the equation collapses and in doing so becomes the left side. This chain is just a way for us to split up the calculations in such a way that we can reuse some of them, but more on that in a bit.

The partial derivative of the cost with respect to the last layer activations will vary depending on which cost function is used. In my case I started out using Sum of Squared Error and then dividing by the number of outputs at the end to calculate the Mean Squared Error (MSE). The indices get a little hard to keep track of since we have to use different indices when taking derivatives, but here's what that looks like:

math

Index substitution can make nested derivatives like this more confusing, so it's better to think about the indices merely as stand-ins for the ranges they represent. Here we used m and n but really all of the indices in the above equations represent the same range. We just use different letters to help us keep track of where different things come from. You may also wonder why we are bothering to include the Kronecker Deltas since they don't actually do anything.

If this still seems unintuitive to you, take a look at the single-variable equivalent problem:

math

You can't always do an apples-to-apples comparison like this, but here it can help you to get a better general idea of what happened and why the answer should seem reasonable.

Now let's talk about the partial derivative of activation functions. Much of the time you will be using an elementwise activation function, meaning it will be applied to each of the elements of the vector independently. We'll use sigma to represent this elementwise operation.

math

For this one it's much easier to just show what it looks like and then discuss what it means:

math

The elements of the derivative vector are placed along the diagonal of a matrix whose entries are otherwise zero. None of the values affect any of the other values which is why only the entries along the diagonal have non-zero entries. The i-th element of A is only affected by the k-th element of Z.

5. Implementation

Now that we've done the math, implementing a neural network like this will be much easier, though still not easy. One of the most confusing things for me when I was working on this was that what you may think of as a layer looking at the math is actually more like the transformation that moves values between layers. If you have ten bias vectors, you'll actually have eleven layers including the input and output layers.

Think back to the 3blue1brown video. What we have been considering a layer so far is actually one of the collections of lines feeding from one layer to the next. So if you have n "layers" in the traditional sense, you'll really have n-1 layers as we have discussed them.

I implemented my network framework in Python using NumPy's einsum() function, but you could do this in any language. My implementation involved a layer class with both input and output layer subclasses. The base layer class describes how to calculate all of the different derivatives for middle layers, and then the output and input layer classes describe how those calculations change for the first and last layers.

Another trick I used was storing the previous layer as a variable in the current layer. This way the network is like a tree but with no branching.

But perhaps the most important thing I did was in figuring out how to reuse as many calculations as possible. For each layer you need to find the partial derivatives of the activations with respect to the bias vector, the weight matrix, and the previous activation vector. If we were to recalculate the entire derivative chain for each derivative and for each layer, it would be horribly inefficient.

That's why I structured the network the way I did. Each layer takes in the chain as it stands, calculates it's relevant partial derivatives (one of which is the next step in the chain), and then calls the backpropagation method on the next layer back, passing along the updated chain.

If you want you can check out my code though I cannot guarantee that it is free of errors. I am reasonably sure that all of the math in this post is correct, but I cannot be 100% sure of that either since I have no one to check my work. So far the network has passed the tests I have put it through, but I've also been quite busy getting ready for college and building this website, so there is more testing to come.

Even using MSE (a regression-oriented cost function) the model is able to attain 90% accuracy on the Sklearn digits classification dataset after about ten epochs with an 80-20 train-test split. This project is still very much a work in progress, but I'm excited enough about it that I wanted to share what I have so far.

6. Conclusion

Thank you so much for reading! This is definitely my favorite project I've worked on to date, and I'm glad that I finally have a way to share some of it with others. Hopefully you learned something or thought about something in a different way. I often find that reframing the way you think about something is the key to understanding topics which previously seemed unreachable.

Pretty much all of this I figured out by researching on my own, but my thanks go out to Sören Laue who very nicely answered some of my questions when I emailed him, and whose paper was the clearest resource I could find after many, many hours of searching.

If you have any questions or comments, feel free to email me (check the links tab)! Thanks again for reading!