multilayer perceptron

1. algorithm
2. continuous and differentiable activation function
- 2.1. sigmoid
Backlinks
- transformer

multiplayer perceptron is a layered architecture of neurons, where:

neurons are divded into layers
adjacent layers are interconnected
the notion of of $a_0$ in is replaced with bias

1. algorithm

1.1. instant state

good old linear combination

forward propagation: from input, calculate instant states and outputs of each layer, and use the outputs as input of the next layer, until the last one

1.2. output

activation function

a lot of choices

identity: $X = f(x) = x$
sigmoid: $X = f(x) = $

1.3. error

output error: $\frac{1}{2} \Sigma^m_{j=1}(t^k_j-X^k_j)^2$
performance on Data set D: sum of output error on each input entry
MLP error function: sum of output error on each input entry, in the training stage. As data don’t change, the error function is a function on weights, which are to be optimized \[ E = E(W) \]

1.4. performance

1.5. training

1.5.1. gradient decent

As error function is a $E(W)$, with gradient decent, we can determine a proportion of adjustment to all of the individual weights(a vector of values), if applied the same time(times -1), should bring the error function down the rapidest

The gradient determined of $E(W)$ contains all $\Delta w$ s

1.5.2. weight update

follows the idea of gradient decent, $\Delta w = -C \frac{dE}{dw}$ (for respective weight)

1.5.3. backpropagation ATTACH

determining derivative of a specific weight/bias using chain rule, from output layer, one layer back at a time.

using chain rule: derivative of f(g(x)) with repect to x = f’(g(x))*g’(x) As error function is recursively calling linear combination and activation function with no additional multiply/log-like term, the derivitive is pretty much just multiplying intermediate computation varaibles, like the weight, instant state of a neuron, etc.
one layer back at a time: this is because a neuron may contribute to error via multiple paths, and those derivatives has to be aggregated.

the algorithm

Use $\frac{dE}{da^{\text{level}}_{k}}$ as intermediate shorthand.

calculating $\frac{dE}{da^{\text{level}}_{k}}$

starting from the closest uptream $\frac{dE}{da^{\text{level}}_{k}}$:

if no such upstream, then is at output layer, where it is simple $label - output$
otherwise, the formula is follows:

\[ \Delta a^{l-1} = \frac{dE}{da^(l)} \frac{d\text{activation}}{d\text{instant state}} \frac{d\text{instant state}}{da^{l-1}} \]. so the same as the weight, only that the last term is now $w_l$ instead of $a^{l-1}$, as they multiplied as the term in the linear combination towards the instant state.

plus, $\frac{dE}{da^{\text{level}}_{k}}$ from all possible paths has to be added together. so if a neuron in a hidden layer gives its output to 2 neurons in the next layer, it’s $\frac{dE}{da^{\text{level}}_{k}}$ will be $\Delta a^{l-1}$ calculated from both paths added together, as with the formula above, only 1 path is calculated at one time.

calculating weight

The forumla is: \[ \Delta w^l = \frac{dE}{da^(l)} \frac{d\text{activation}}{d\text{instant state}} frac{d\text{instant state}}{dw^l} \]. of the 3 terms:

$\frac{dE}{da^{l}}$ is already computed for hidden layers, and trivial for last layer ($label_i - a^{last}_i$)
$\frac{d\text{activation}}{d\text{instant state}}$ is easy, as activation function’s derivative is easy, and instant state is already computed in the forward propagation.
$\frac{d\text{instant state}}{dw^l}$ is easy as map from $w_l$ to the instant state is a linear combination, and the derivative of $w_l$ to instant state is the output from last layer $a^{l-1}$ that is coupled with $w_l$.

starting from the closest upstream(from the direct higher layer) $\frac{dE}{da^{\text{level}}_{k}}$, and multiply it with the activation function derivative (uses instant state), and the matching last layer input of the weight $a^{(level - 1)}$, as instant state is computed with $S = w_1a^{(level-1)}_1 + w_2a^{(level-1)}_2 +... + w_na^{(level-1)}_n$, of which all terms other than the $w_ia^{(level-1)_i}$ that we are looking at can be ignored.

2. continuous and differentiable activation function

People want that so that we can calculate gradient of error function.

Here are some choices:

generic sigmoid \[ f(S) = \frac{\alpha}{1 + e^{-\beta S + \gamma}} + \lambda \]
- tuning
  - $\alpha \beta$ determine steepness
  - $\beta$ determine slope
  - $\gamma$ determine x-axis shift
  - $\alpha,\lambda$ determine value range in x and y
- derivative \[ f'(S) = \frac{df}{dS} = \frac{\beta}{\alpha}(f(S)+\lambda)(\alpha + \lambda - f(S)) \]

2.1. sigmoid

generic sigmoid with $\alpha = 1, \beta = 1, \gamma = 0, \lambda = 0$ \[ f(s) = \frac{1}{1 + e^{-S}} \] \[ f'(s) = f(S)(1 - f(S)) \]

Backlinks

a neural network architecture to achieve sequence similarity mapping (map sequence A to sequence B that is similar to sequence A)

It defines several neural network modules that are mostly based on multilayer perceptron, and combine them sequentially(stacking) and in parallell(multiplexing), namely: