Shafiullah, Nur Muhammad Mahi and Cui, Zichen Jeff and Altanzaya, Ariuntuya and Pinto, Lerrel ::: Behavior Transformers: Cloning \(k\) Modes with One Stone
Table of Contents
- 1. What they did in one sentence
- 2. what small tweaks they had done to make the result better
- 3. Main ideas
- 4. My questions
- 4.1. why is action descretization an integral part of this
- 4.2. how can residual action be located
- 4.3. what is online rollout of a behaviour policy
- 4.4. why MT-Loss for the residuals
- 4.5. how are residual actions trained
- 4.6. why not have one head for each k-means center
- 4.7. is focal loss a generally better loss function than cross entropy
1. What they did in one sentence
They fed observations into a transformer with input vocab observation output vocab action
2. what small tweaks they had done to make the result better
- binning and unbinning action - improves multimodality, a lot
- binning - k-means k bins for all continuous actions
- in training, at the same time trained action residuals with another decoder head.
- historical context(window) - improves everything a bit.
3. Main ideas
3.1. Use transformer to do the job
3.2. Discretize action: action -> bins of actions
is seq-to-seq model, and imitation model observation-to-action, action has to be discretized to be fed into decoder
3.3. Action Factorization: center + residual
center = center in k-means residual = action - center
3.4. Action correction: bins of actions -> actions
an extra transformer decoder head that offsets the discretized action centers
3.5. focal loss
cross entropy - L(p) = -log(p) focal loss - L(p) = -log(p) * (1-p)m -> focal loss gets steeper when p is smaller, flatter when p gets larger
3.6. conponents
- binning actiions
- offset to reconstruct
- use historical context (window)
- use attenion-based MinGPT trunk(transformer)
4. My questions
4.1. why is action descretization an integral part of this
as residual action(which is continuous) is still fed into the transformer.
4.2. how can residual action be located
as only bin
4.3. what is online rollout of a behaviour policy
4.4. why MT-Loss for the residuals
what does the loss function do
4.5. how are residual actions trained
4.6. why not have one head for each k-means center
then you’ll have good old vectors instead of matrix
4.7. is focal loss a generally better loss function than cross entropy
or is it a good loss function specifically in this case. Why use it over cross entropy