The Math behind Back-propagation. My Deep Learning journey started during… | by Hiritish Chidambaram N

My Deep Studying journey began through the summer season of 2024, and since then I’ve been trying to find a correct YouTube video or weblog that describes the whole Math behind back-propagation. There are a lot of sources that superbly describe the instinct behind back-propagation, however by no means went into the Math utterly. There are fairly a couple of sources that do contact the Math of back-propagation however none of them had been capable of fulfill me.

So this summer season I made a decision to create my very own weblog discussing the Math behind Again-propagation. Hopefully this weblog fills the small gaps current within the already huge quantity of knowledge that’s current on the Web concerning Again-propagation. Earlier than beginning this weblog I’d extremely suggest you to look at the primary 4 movies of this YouTube playlist. This playlist of movies from 3Blue1Brown superbly explains the instinct behind back-propagation.

Get a pen and a pocket book with the intention to work together with me. The easiest way to know any Math is to do it… Now let’s arrange the neural community that we’ll be utilizing to know the Math. We shall be utilizing a easy 3 layer neural community.

Here is our 3 layer binary classification neural community. We’re utilizing tanh activation features for layers 1 and a pair of. For the ultimate layer we shall be utilizing the sigmoid activation operate.

For the sake of simplicity we shall be assuming that coaching is completed utilizing a batch dimension of 1 — that means each step within the coaching is completed with only one coaching instance relatively than utilizing a extra typical batch dimension of 16 or 32.

Notations:

Listed here are all of the parameters and outputs in each layer:

Enter Layer:

Layer 1:

Layer 2:

Layer 3:

Earlier than transferring on to ahead and backward propagation there are a few Mathematical ideas that you could familiarize your self with. Be at liberty to comeback to this part of the weblog if you must.

Transpose

Transpose of a matrix is principally one other matrix the place the rows of the brand new matrix are the columns of the previous matrix and vise-versa.

Matrix Multiplication

The golden rule for matrix multiplication between two matrices A and B (AXB) is that the variety of columns in matrix A have to be equal to the quantity rows in matrix B. The resultant matrix can have the identical variety of rows as matrix A and the identical variety of columns as matrix B. Right here is the way you really carry out matrix multiplication:

Differentiation

Differentiation allows us to search out the speed of change of a operate with respect to the variable that the operate is determined by. Listed here are a couple of necessary differentiation equations that one should know

Fig.2.2.2: Spinoff of the hyperbolic tan operate

Partial differentiation

Partial differentiation allows us to search out the speed of change of a operate with respect to one of many many variables it’s depending on. Each time you’re doing partial differentiation with respect to a specific variable, we assume all different variables to be constants and apply the principles of regular differentiation. Watch this video for examples. For our explicit state of affairs we should first perceive the right way to carry out partial differentiation on a scalar with respect to a vector and on a vector with respect to a vector:

Fig.2.4: Dot Product of Weight and Enter Vectors — Basic Operation in Neural Networks

Forward propagation refers back to the strategy of sending our inputs by means of the neural community to get a prediction. Earlier than performing back-propagation, we should first carry out ahead propagation because the we want the outputs of ahead propagation to carry out back-propagation.

You would possibly discover using transpose when multiplying two vectors. Transpose ensures that the shapes of the matrices being multiplied are in compliance with the principles of matrix multiplication. In the event you want a refresher on Matrix multiplication guidelines take a look at this YouTube video

Layer 1 Outputs from ahead propagation:

Fig.3.1.1: Passing the enter vector x to the primary layer

Fig.3.1.2: Making use of tanh activation operate on the pre-activations from layer 1

Fig.3.1.3: Remaining output from layer 1

Layer 2 Outputs from ahead propagation:

Now the output from layer 1 turns into the enter for layer 2

Fig.3.2.1: Passing the output vector from layer 1 because the enter vector to the second layer

Fig.3.2.2: Making use of tanh activation operate on the pre-activations from layer 2

Fig.3.2.3: Remaining output from layer 2

Layer 3 Outputs from ahead propagation:

Now the output from layer 2 turns into the enter for layer 3

Fig.3.3.1: Passing the output vector from layer two because the enter vector to the third layer

Fig.3.3.2: Making use of tanh activation operate on the pre-activations from layer 3

Fig.3.3.3: Remaining output from layer 3 and that is the ultimate output of our neural community

Now as we’ve all the outcomes from our ahead propagation let’s transfer on to again propagation

Again propagation refers back to the propagating the error from the output layer to the sooner layers of the community and making corrections utilizing gradient descent with a purpose to scale back loss and enhance the general accuracy and precision of the mannequin. The loss operate that we’ll be utilizing in our case is Binary Cross Entropy. In actual world purposes of Binary classification, throughout coaching the loss operate used is Binary Cross Entropy Loss with logits. However for the aim of simplicity we shall be utilizing the traditional Binary Cross Entropy Loss.

Right here is the equation that describes the Binary Cross Entropy Loss

One thing that I had struggled throughout my first couple of weeks is mixing up what the loss and value features the place. The loss operate (represented utilizing ‘L’) measures the error of our mannequin for a single coaching instance whereas the associated fee operate (represented utilizing ‘J’) measures the typical error of our mannequin for a single batch of coaching examples.

As we had assumed the batch dimension to be 1 for the sake of simplicity, the loss and the associated fee features would be the identical.

Discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.3: Half 3 of discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.4: Half 4 of discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.5: Half 5 of discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.6: Half 6 of discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.7: Half 7 of discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.8: Half 8 of discovering partial derivatives for again propagation updates at layer 3

Fig.4.3.9: Half 9 of discovering partial derivatives for again propagation updates at layer 3

Discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.3: Half 3 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.4: Half 4 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.5: Half 5 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.6: Half 6 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.7: Half 7 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.8: Half 8 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.9: Half 9 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.10: Half 10 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.11: Half 11 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.12: Half 12 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.13: Half 13 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.14: Half 14 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.15: Half 15 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.16: Half 16 of discovering partial derivatives for again propagation updates at layer 2

Fig.4.4.17: Half 17 of discovering partial derivatives for again propagation updates at layer 2

Discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.3: Half 3 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.4: Half 4 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.5: Half 5 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.6: Half 6 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.7: Half 7 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.8: Half 8 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.9: Half 9 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.10: Half 10 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.11: Half 11 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.12: Half 12 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.13: Half 13 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.14: Half 14 of discovering partial derivatives for again propagation updates at layer 1

Fig.4.5.15: Half 15 of discovering partial derivatives for again propagation updates at layer 1

Updates to the parameters

Now we’ve discovered all of the partial derivatives required for updating parameters throughout all layers. Utilizing equation 4 in determine 4.3.6, 6 in determine 4.3.9, 4 in determine 4.4.12, 6 in determine 4.4.17, 4 in determine 4.5.11 and 6 in determine 4.5.15 we will carry out parameter updates utilizing the next equations.

Fig.4.6: Layer smart Weight and Bias Updates for a 3-Layer Neural Community

Phewww!🤯 That was one hell of a journey. Hopefully you didn’t instantly soar to the conclusion… And in the event you did I actually don’t blame you 😜. These items is actually exhausting and you’ll by no means be requested to do such derivations in a job interview. So that is simply to train and flex your Math muscle. However in the event you really went by means of every step one after the other, hats off buddy!🥳

Source link

Detrás de DigiDomTek:. Cómo una tragedia personal en el Caribe… | by Benjamin R Miller | May, 2025

Building Real-World AI Apps with Google’s Gemini & Imagen | by Vipin Kumar | May, 2025

XGBoost, LightGBM or CatBoost? The Ultimate Test for Credit Scoring Models | by Pape | May, 2025

VC Compliance Is Boring But Necessary — Here’s Why

My Bear Market Investment Game Plan: Adjusting the Strategy

Shaquille O’Neal on Franchising, Investing, and Fighting Nerves

Global Survey: 92% of Early Adopters See ROI from AI

Week 8: Type-2 Fuzzy Systems. What Are Fuzzy Logic Systems? | by Adnan Mazraeh | Feb, 2025

Most Popular

OpenAI Would Love to Buy Google Chrome Browser: ChatGPT Exec

Seal More Deals With Business Language Learning from Babbel

I Optimized a Mutual Fund Portfolio with NSGA-III — Then the Stress Test Broke It | by keqDC | May, 2025

Our Picks

How AI Is Rewriting the Day-to-Day of Data Scientists

What do we know about the economics of AI? | MIT News

AI in Social Research and Polling

The Math behind Back-propagation. My Deep Learning journey started during… | by Hiritish Chidambaram N | May, 2025