|
The delta rule is a rule for updating the weights of the neurons in a single-layer perceptron. For a neuron <math>j<math> with activation function <math>g(x)<math> the delta rule for <math>j<math>'s <math>i<math>th weight is
- <math>\Delta w_{ji}=\alpha(t_j-x_j) g'(h_j) x_i<math>
where <math>\alpha<math> is a small constant, <math>g(x)<math> is the neuron's activation function, <math>t_j<math> is the target output, <math>x_j<math> is the actual output, and <math>x_i<math> is the <math>i<math>th input. The delta rule is commonly stated in simplified form for a perceptron with a linear activation function as
- <math>\Delta w_{ji}=\alpha(t_j-x_j) x_i<math>
Derivation of the delta rule
The delta rule is derived by attempting to minimize the error in the output of the perceptron through gradient descent. The error for a perceptron with <math>j<math> outputs can be measured as
- <math>E=\sum_{j} \frac{1}{2}(t_j-x_j)^2<math>.
In this case, we wish to move through "weight space" of the neuron (the space of all possible values of all of the neuron's weights) in proportion to the gradient of the error function with respect to each weight. In order to do that, we calculate the partial derivative of the error with respect to each weight. For the <math>i<math>th weight, this derivative can be written as
- <math>\frac{\delta E}{ \delta w_{ji} }<math>.
Because we are only concerning ourselves with the <math>j<math>th neuron, we can substitute the error formula above while omitting the summation:
- <math>\frac{\delta E}{ \delta w_{ji} } = \frac{ \delta \left ( \frac{1}{2} \left( t_j-x_j \right ) ^2 \right ) }{ \delta w_{ji} }<math>
Next we use the chain rule to split this into two derivatives:
- <math>= \frac{ \delta \left ( \frac{1}{2} \left( t_j-x_j \right ) ^2 \right ) }{ \delta x_j } \frac{ \delta x_j }{ \delta w_{ij} }<math>
To find the left derivative, we simply apply the general power rule:
- <math>= - \left ( t_j-x_j \right ) \frac{ \delta x_j }{ \delta w_{ij} }<math>
To find the right derivative, we again apply the chain rule, this time differentiating with respect to the total input to <math>j<math>, <math>h_j<math>:
- <math>= - \left ( t_j-x_j \right ) \frac{ \delta x_j }{ \delta h_j } \frac{ \delta h_j }{ \delta w_{ij} }<math>
Note that the output of the neuron <math>x_j<math> is just the neuron's activation function <math>g()<math> applied to the neuron's input <math>h_j<math>. We can therefore write the derivative of <math>x_j<math> with respect to <math>h_j<math> simply as <math>g()<math>'s first derivative:
- <math>= - \left ( t_j-x_j \right ) g'(h_j) \frac{ \delta h_j }{ \delta w_{ij} }<math>
Next we rewrite <math>h_j<math> in the last term as the sum over all <math>k<math> weights of each weight <math>w_{jk}<math> times its corresponding input <math>x_k<math>:
- <math>= - \left ( t_j-x_j \right ) g'(h_j) \frac{ \delta \left ( \sum_{k} x_k w_{jk} \right ) }{ \delta w_{ij} }<math>
Because we are only concerned with the <math>i<math>th weight, the only term of the summation that is relevant is <math>x_i w_{ji}<math>. Clearly,
- <math>\frac{ \delta x_i w_{ji} }{ \delta w_{ji} }=x_i<math>,
giving us our final equation for the gradient:
- <math>\frac{\delta E}{ \delta w_{ji} } = - \left ( t_j-x_j \right ) g'(h_j) x_i<math>
As noted above, gradient descent tells us that our change for each weight should be proportional to the gradient. Choosing a proportionality constant <math>\alpha<math> and eliminating the minus sign to enable us to move the weight in the negative direction of the gradient to minimize error, we arrive at our target equation:
- <math>\Delta w_{ji}=\alpha(t_j-x_j) g'(h_j) x_i<math>.
|