decoder-only transformer

The key difference between it and a encoder-decoder transformer is that:

in encoder-decoder transformer:

encoder takes a whole sentence as input, and output to the decoder one embedding after another.

    for example, for "I eat apple and Kay eats pear", "I"'s meaning is calculated (in the form of a new embedding) with all other word's positioned embedding(raw embedding + positional encoding(with sin and cos; unique for each word)), and and so are all others.

    Then, they are outputed to decoder 1 by 1: 'I', 'eat', 'apple'....'pear', '<end-of-sentence>'

decoder take 1 embedding, and output another one. Then it is trained so that the generated embedding would be semantically the “next”, so given “I”, generate “eat”.

in decoder-only transformer:
- token’s are fed in with their raw embedding directly to decoder.
- attention of a token takes in scope only all the tokens that comes before it(and self).