Everyone doing the Andrew Ng Coursera course hits the same wall. The lectures are good, the intuition is there, and then the backpropagation derivations show up and suddenly you're staring at a page of partial derivatives and delta notation and wondering if you missed a prerequisite.
I have a computer engineering background, not a math one. I can read the notation slowly but I don't move through it naturally. It took me about three weeks of reading the same explanations from different angles before it actually clicked.
This is the version I wish I had found on week one.
What the network is doing
A neural network runs an input through a series of layers. Each layer multiplies the input by a set of weights, adds a bias, and passes the result through an activation function. The output of the last layer is the prediction.
Training the network means finding weights that make the predictions good. "Good" is measured by a loss function, which gives you a single number representing how wrong the model is right now.
The goal is to make that number go down.
Gradient descent
To make the loss go down you need to know which direction to move each weight. If you increase weight W by a tiny amount, does the loss go up or down? By how much?
That's the gradient. It tells you the slope of the loss with respect to each weight. Move in the downhill direction, take a small step, repeat.
The problem is calculating the gradient for every weight efficiently. Even a small network has thousands of them.
What backpropagation actually is
Backpropagation is an algorithm that calculates all those gradients in one pass, going backwards through the network.
It uses the chain rule. If A influences B and B influences the loss, then the gradient of A is the gradient of B multiplied by how much A affects B. You start at the output, calculate the gradient there, then pass it backwards layer by layer. Each layer receives a gradient, uses it to calculate its own weight updates, and passes a modified gradient to the layer before it.
That's the whole thing. It's the chain rule applied systematically, backwards through the network.
What actually made it click
I stopped reading about it and implemented it from scratch on a tiny network. Two inputs, three hidden neurons, one output. No Theano, no library, just Python and numpy.
When I had to write the backward pass myself, the abstractions became concrete numbers I could print and inspect. The partial derivative of the loss with respect to weight three wasn't a symbol, it was a float I could check against a numerical gradient estimate.
If you're stuck on this, build a toy network by hand. The confusion goes away fast once you have to turn the math into actual code.
With gusto, Fatih.