240628 Multivariate derivative chain rule with vector or matrix, in Machine Learning

阅前提醒:
本文可能使用了大量比较随便的英文表达;
本文的数学名词、公式可能不严谨;
总之意思传达到就行。

While doing the lab Backpropogation from week 3 of Math for ML course, some questions poped up in my mind.

Some background

2ba1117f396ffb35c432eb0a8df15466.png

f1872ba195b6f6120a5bb3df988e28e5.png

a(L)\boldsymbol{a}^{(L)} is the values of the nodes at layer L. a(L)=[a0(L),a1(L),,am1(L)]\boldsymbol{a}^{(L)} = [a_0^{(L)}, a_1^{(L)}, \dots, a_{m-1}^{(L)}], where m is the number of nodes at layer L.

z(L)z^{(L)} is the values before passing to activation function, just a trick to make the derivation of the activation function easier.

W(L)\boldsymbol{W}^{(L)} is the weight from layer L-1 to layer L, and b(L)\boldsymbol{b}^{(L)} is the bias from layer L-1 to Layer 1.

σ\sigma is some activation function.

CC is the cost function, yiy_i is the truth value in the dataset for training the network, corresponding to a input ai(0)a_i^{(0)}

What we want to do is to find good W(L)\boldsymbol{W}^{(L)} and b(L)\boldsymbol{b}^{(L)} that minimize the cost, and for that we need to take partial derivative of CC with respect to each W and b.

Questions

In the lab Backpropogation, I doubted a partial derivative formula:

CW(3)=Ca(3)a(3)z(3)z(3)W(3)\frac{\partial C}{\partial \boldsymbol{W}^{(3)}} = \frac{\partial C}{\partial \boldsymbol{a}^{(3)}} \frac{\partial \boldsymbol{a}^{(3)}}{\partial \boldsymbol{z}^{(3)}} \frac{\partial \boldsymbol{z}^{(3)}}{\partial \boldsymbol{W}^{(3)}}

  1. Why we can take partial derivative of CC with respect to W\boldsymbol{W}?
  2. In the cost function, it sums the squre of ai(L)yia_i^{(L)} - y_i, why in the derivative formula it becomes the partial derivative of CC with respect to a\boldsymbol{a}, and how?

Analysis

In the previous lectures, I’ve learned about chain rule for multivariate derivative with function being like this form:

f(x(t)),x=[x1(t),x2(t),...xn(t)]f(\boldsymbol{x}(t)), \boldsymbol{x} = [x_1(t), x_2(t), ... x_n(t)]

ff is a function of x1(t),x2(t),...xn(t)x_1(t), x_2(t), ... x_n(t), and each elements of x\boldsymbol{x} is a function of tt, and in the end, ff can be a function of tt. Thus, we can try to find derivative of ff with respect to tt, by using multivariate chain rule:

Equation 1

dfdt=fxdxdt\frac{df}{dt} = \frac{\partial f}{\partial \boldsymbol{x}} \frac{d\boldsymbol{x}}{dt}

Equation 2

dfdt=fx1dx1dt+fx2dx2dt++fxndxndt\frac{df}{dt} = \frac{\partial f}{\partial x_1} \frac{dx_1}{dt} + \frac{\partial f}{\partial x_2} \frac{dx_2}{dt} + \dots + \frac{\partial f}{\partial x_n} \frac{dx_n}{dt}

The fx\frac{\partial f}{\partial \boldsymbol{x}} and dxdt\frac{d\boldsymbol{x}}{dt} in Equation 1 are vectors, and dfdt\frac{df}{dt} is the dot product of these two vectors. The result of dot product just equals to the sum Equation 2. Therefore, we can use the simple form Equation 1.

The independent variable tt is a single variable. However, what would the formula be if tt is also a pack of variables? e.g.:

t=[t1,t2]\boldsymbol{t} = [t_1, t_2]

To make it easier for understanding, we can use an example, say x=[x1,x2]\boldsymbol{x} = [x_1, x_2], and t=[t1,t2]\boldsymbol{t} = [t_1, t_2],
ff is a function of x\boldsymbol{x}, f(x)f(\boldsymbol{x}), (or explicitly f(x1,x2)f(x_1, x_2), and each element in x\boldsymbol{x} is a function of t\boldsymbol{t}:

x=[x1(t),x2(t)]x = [x_1(\boldsymbol{t}), x_2(\boldsymbol{t})]

or explicitly

x=[x1(t1,t2),x2(t1,t2)]x = [x_1(t_1, t_2), x_2(t_1, t_2)]

So, ff can be a function of t1t_1 and t2t_2. Now, if we want to find derivative of ff with respect to t\boldsymbol{t}, we have to do partial defferentiation on t1t_1 and t2t_2 respectively:

ft1=fx1x1t1+fx2x2t1\frac{\partial f}{\partial t_1} = \frac{\partial f}{\partial x_1}\frac{\partial x_1}{\partial t_1} + \frac{\partial f}{\partial x_2}\frac{\partial x_2}{\partial t_1}

It can be written as dot product of vector

fx=[fx1,fx2]\frac{\partial f}{\partial \boldsymbol{x}} = [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}]

and vector

xt1=[x1t1,x2t1]\frac{\partial \boldsymbol{x}}{\partial t_1} = [\frac{\partial x_1}{\partial t_1}, \frac{\partial x_2}{\partial t_1}]

Thus,

ft1=fxxt1\frac{\partial f}{\partial t_1} = \frac{\partial f}{\partial \boldsymbol{x}}\frac{\partial \boldsymbol{x}}{\partial t_1}

And similarly for t2t_2

ft2=fxxt2\frac{\partial f}{\partial t_2} = \frac{\partial f}{\partial \boldsymbol{x}}\frac{\partial \boldsymbol{x}}{\partial t_2}

Until here, we can roughly answer a part of the question 1:

CC is a function of a(L)\boldsymbol{a}^{(L)}, and a(L)\boldsymbol{a}^{(L)} is a function of W(L)\boldsymbol{W}^{(L)} and b(L)\boldsymbol{b}^{(L)}. Therefore, CC can also be a function of W(L)\boldsymbol{W}^{(L)} and b(L)\boldsymbol{b}^{(L)}.

Just like above example(ff, t1t_1, and t2t_2), we can take partial derivative of CC with respective W(L)\boldsymbol{W}^{(L)} and b(L)\boldsymbol{b}^{(L)}.

However, there is new question:

W(L)\boldsymbol{W}^{(L)} is a matrix, how do we take derivative with respect to a matrix?

Key point: derivative with respect to each element in matrix/vector

My idea is that, this is just a simpler form for writing. We can’t take derivative with respect to a matrix or vector, but we actually do that on each element in the matrix or vector.

For the above example ff, t1t_1, and t2t_2, we can also put ft1\frac{\partial f}{\partial t_1} and ft2\frac{\partial f}{\partial t_2} together in a vector, and write:

ft=[ft1,ft2]\frac{\partial f}{\partial \boldsymbol{t}} = [\frac{\partial f}{\partial t_1}, \frac{\partial f}{\partial t_2}]

Answer to the question

CC is a function of a(L)=[a0(L),a1(L),,am1(L)]\boldsymbol{a}^{(L)} = [a_0^{(L)}, a_1^{(L)}, \dots, a_{m-1}^{(L)}].

Each ai(L)a_i^{(L)} is a function of wi(L)=[wi,0(L),wi,1(L),,wi,n1(L)]\boldsymbol{w}_i^{(L)} = [w_{i,0}^{(L)}, w_{i,1}^{(L)}, \dots, w_{i,n-1}^{(L)}] and bi(L)b_i^{(L)}. n is the number of nodes in the layer L-1.

We, in fact, are not taking derivative of CC with respect to the matrix W(L)\boldsymbol{W}^{(L)}, but to each variable wi,j(L)w_{i,j}^{(L)}, while in the end, writing the result as simpler form CW(L)\frac{\partial C}{\partial \boldsymbol{W}^{(L)}}.

Same for Ca(L)\frac{\partial C}{\partial \boldsymbol{a}^{(L)}}, we need to take derivative of CC with respect to each variable ai(L)a_i^{(L)}, and in the end write as a simpler form.

So let’s start by doing derivative of zi(L)z_i^{(L)} with respect to wi(L)\boldsymbol{w}_i^{(L)}.

Expan the formula zi(L)=wi(L)a(L1)+bi(L)z_i^{(L)} = \boldsymbol{w}_i^{(L)} \cdot \boldsymbol{a}^{(L-1)} + b_i^{(L)}:

zi(L)=[wi,0(L),wi,1(L),,wi,n1(L)][a0(L1),a1(L1),,an1(L1)]+bi(L)z_i^{(L)} = [w_{i,0}^{(L)}, w_{i,1}^{(L)}, \dots, w_{i,n-1}^{(L)}] \cdot [a_0^{(L-1)}, a_1^{(L-1)}, \dots, a_{n-1}^{(L-1)}] + b_i^{(L)}

We can easily get the result of taking derivative of ziz_i with respect to wi,j(L)w_{i,j}^{(L)}, it’s just aj(L1)a_j^{(L-1)}. But let’s keep these derivative symbols (since the result is not important for our topic) and put all the derivatives in a vector:

Equation 3

zi(L)wi(L)=[zi(L)wi,0(L),zi(L)wi,1(L),,zi(L)wi,n1(L)]\frac{\partial z_i^{(L)}}{\partial \boldsymbol{w}_i^{(L)}} = [\frac{\partial z_i^{(L)}}{\partial w_{i, 0}^{(L)}}, \frac{\partial z_i^{(L)}}{\partial w_{i, 1}^{(L)}}, \cdots, \frac{\partial z_i^{(L)}}{\partial w_{i, n-1}^{(L)}}]

And then we take derivative of ai(L)a_i^{(L)} with respect to zi(L)z_i^{(L)}:

ai(L)zi(L)=σ(zi(L))\frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} = \sigma^{\prime}(z_i^{(L)})

Then we do Cai(L)\frac{\partial C}{\partial a_i^{(L)}}. Remember that:

C=i(ai(L)yi)2=(a0(L)y0)2+(a1(L)y1)2++(am1(L)ym1)2C = \sum_i (a_i^{(L)} - y_i)^2 = (a_0^{(L)} - y_0)^2 + (a_1^{(L)} - y_1)^2 + \cdots + (a_{m-1}^{(L)} - y_{m-1})^2

We only take derivative with respect to ai(L)a_i^{(L)}:

Cai(L)=2(ai(L)yi)\frac{\partial C}{\partial a_i^{(L)}} = 2(a_i^{(L)} - y_i)

At last, we can get partial derivative of CC respect to each wi,j(L)w_{i,j}^{(L)}

Cwi,j(L)=Cai(L)ai(L)zi(L)zi(L)wi,j(L)\frac{\partial C}{\partial w_{i,j}^{(L)}} = \frac{\partial C}{\partial a_i^{(L)}}\frac{\partial a_i^{(L)}}{\partial z_i^{(L)}}\frac{\partial z_i^{(L)}}{\partial w_{i,j}^{(L)}}

All the derivative term Cai(L)\frac{\partial C}{\partial a_i^{(L)}}, ai(L)zi(L)\frac{\partial a_i^{(L)}}{\partial z_i^{(L)}}, zi(L)wi,j(L)\frac{\partial z_i^{(L)}}{\partial w_{i,j}^{(L)}}, are just scalars, so is the result Cwi,j(L)\frac{\partial C}{\partial w_{i,j}^{(L)}}.

Now, what if we replace the single variable wi,j(L)w_{i,j}^{(L)} with a vector wi(L)\boldsymbol{w}_i^{(L)}?

Then the last term becomes the Equation 3. It is a vector! The first and second term doesn’t change, because we still take derivative with respect to ai(L)a_i^{(L)} and zi(L)z_i^{(L)}.

So what happens to Cwi,j(L)\frac{\partial C}{\partial w_{i,j}^{(L)}}? It also turns to a vector!

Cwi(L)=[Cwi,0(L),Cwi,1(L),,Cwi,n1(L)]\frac{\partial C}{\partial \boldsymbol{w}_i^{(L)}} = [ \frac{\partial C}{\partial w_{i,0}^{(L)}}, \frac{\partial C}{\partial w_{i,1}^{(L)}}, \cdots, \frac{\partial C}{\partial w_{i, n-1}^{(L)}} ]

What is wi(L)\boldsymbol{w}_i^{(L)}? It is the ith row of the matrix W(L)\boldsymbol{W}^{(L)}. If we then do similar things to each row of W(L)\boldsymbol{W}^{(L)}, and stack them together, we get a new matrix, that is CW(L)\frac{\partial C}{\partial \boldsymbol{W}^{(L)}}!

CW(L)=[Cw0(L)Cw1(L)Cwm1(L)]\frac{\partial C}{\partial \boldsymbol{W}^{(L)}} = \begin{bmatrix} \frac{\partial C}{\partial \boldsymbol{w}_0^{(L)}} \\ \frac{\partial C}{\partial \boldsymbol{w}_1^{(L)}} \\ \vdots \\ \frac{\partial C}{\partial \boldsymbol{w}_{m-1}^{(L)}} \end{bmatrix}

Take away

When there is vector or matrix in a derivative formula, we are taking derivative with respect to each element in matrix or vector.