transformer encoder

in transformer encoder, several steps are performed:

tokens are mapped to their embedding (with normally a fully connected neural network layer and identity as activation function)
position of token in sentence is fused with transformer position encoder -> positioned embedding
inter-token similarity/correlation is measured with query neural net and key neural net using the positioned embedding as input
1. first with the positioned embedding, its query embedding and key embedding are computed with respective neural nets
2. then key embeddings are queried with query embedding to output correlation, which are passed through softmax to output a percentage of contribution/relation of the word from all words in the sentence(including self)
3. This would finally output a vector looking like [.4 .3 .3], referring to each token’s influcence on the current token being computed, so if we have passed in the second in the sentence token’s embedding, this vector would mean that first token contributes to/represents .4 of the second token’s meaning, the second token .3, the third .3. This is the self-attention vector
self-attention is fused with value neural net and self-attention vector
1. another neural net, similar to query and key, is used to map the positioned embeddings to a value embedding
2. the value embeddings (of each token in the sentence) is then mixed with the self-attention vector ([.4 .3 .3] -> .4 * valueEmbedding₁ + .3 * valueEmbedding₂ + .3 * valueEmbedding₃, valueEmbedding_n is a vector), to output the self-attention values (self-attention embedding)