Principle of Maximum Likelihood or how to find Neural network weights

Andriy Drozdyuk
1 min readOct 5, 2021

What we’re really after is the distribution of weights W. Now, even a single w will do, so trivially the distribution can be P(w)=1 for a specific weight w and 0 for everything else.

Since we don’t have the true distribution P(W), even if we were to somehow guess W (let’s denote our guess by Ŵ) we would have no way of knowing how wrong we are, and thus no way of adjusting our guess to make it less wrong.

But since our P(W) can be written like this (using Bayes rule with 3 variables):

we don’t really care about the “blah blah” factors, and just have to get our P(Y|Ŵ,X) right by adjusting our W.

So if we get it as close as possible to real P(Y|W,X) then our $P(Ŵ|X,Y)$ will be closer to real P(W|X,Y).

And this is the principle of maximum likelihood, for the term P(Y|W,X) is called “likelihood” and “maximum” because individual probabilities for every point are maximum given the found w.

--

--