Shafiullah, Nur Muhammad Mahi and Cui, Zichen Jeff and Altanzaya, Ariuntuya and Pinto, Lerrel ::: Behavior Transformers: Cloning \(k\) Modes with One Stone

1. What they did in one sentence
2. what small tweaks they had done to make the result better
3. Main ideas
4. My questions

1. What they did in one sentence

They fed observations into a transformer with input vocab observation output vocab action

2. what small tweaks they had done to make the result better

binning and unbinning action - improves multimodality, a lot
- binning - k-means k bins for all continuous actions
- in training, at the same time trained action residuals with another decoder head.
historical context(window) - improves everything a bit.

3. Main ideas

3.1. Use transformer to do the job

3.2. Discretize action: action -> bins of actions

is seq-to-seq model, and imitation model observation-to-action, action has to be discretized to be fed into decoder

3.3. Action Factorization: center + residual

center = center in k-means residual = action - center

3.4. Action correction: bins of actions -> actions

an extra transformer decoder head that offsets the discretized action centers

3.5. focal loss

cross entropy - L(p) = -log(p) focal loss - L(p) = -log(p) * (1-p)^m -> focal loss gets steeper when p is smaller, flatter when p gets larger

3.6. conponents

binning actiions
offset to reconstruct
use historical context (window)
use attenion-based MinGPT trunk(transformer)

4. My questions

4.1. why is action descretization an integral part of this

as residual action(which is continuous) is still fed into the transformer.

4.2. how can residual action be located

as only bin

4.3. what is online rollout of a behaviour policy

4.4. why MT-Loss for the residuals

what does the loss function do

4.5. how are residual actions trained

4.6. why not have one head for each k-means center

then you’ll have good old vectors instead of matrix

4.7. is focal loss a generally better loss function than cross entropy

or is it a good loss function specifically in this case. Why use it over cross entropy