Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

1. core idea

find, somehow, F_z and B with respect to a Q-function and on the states and actions(the problem). F would give you a parameterized policy for all \(\mathbf{R}^d\), and B can be combined with reward to give you the index/parameter to determine the exact optimal policy from the lots of policies derived from \(F\).

Thus, F represents all of the futures from one state(to another), and B represents a way of getting there

2. In practic

2.1. getting F and B

The are learnt together with randomly sampled states and transitions from a dataset, as neural networks. F would be one the recieve 3 inputs, and return #action outputs (a vector of #action cardinality) B 2 input, but return the same amount, #action outputs (a vector of #action cardinality)

2.2. Fitting in reward -> \(z_r\)

\(z_r\) is a vector of #action cardinality, that is the expectation of the formula \(r(s,a)B(s,a)\), over s and a’s used in the training.(or following the exploration policy)

semantic of it would be policy to get to a state, weighted by reward of individual state.

3. Comment

This paper is full of resources around the Forward Backward framework:

lots of propositoins and remarks, realy discussing this framework
lots of math, definitions and remarks on implementation

Backlinks

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

(Method)

using a behaviour foundation model, with lots of tweaks and techniques:

FB(forward-backword) framework - FB-IL Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

Here’s a script to insert multiple org-roam nodes

(defun hermanhel-strings-to-hash (strings)
  "Convert a list of STRINGS to a hash table with the strings as keys."
  (let ((hash (make-hash-table :test 'equal)))
    (dolist (str strings)
      (puthash str t hash))
    hash))

(defun hermanhel-org-roam-insert-multiple-nodes-as-list ()
  (interactive)
(let
    (
     (candidates (hermanhel-strings-to-hash (org-roam--get-titles)))
     (selected-nodes (citar--select-multiple "References: " candidates))
     )
(dolist (title selected-nodes)
      (insert "+ " "[[roam:" title "]]" "\n")
      )
)
)

Backlinks

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

(Method)

using a behaviour foundation model, with lots of tweaks and techniques:

FB(forward-backword) framework - FB-IL Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

Here’s a script to insert multiple org-roam nodes

(defun hermanhel-strings-to-hash (strings)
  "Convert a list of STRINGS to a hash table with the strings as keys."
  (let ((hash (make-hash-table :test 'equal)))
    (dolist (str strings)
      (puthash str t hash))
    hash))

(defun hermanhel-org-roam-insert-multiple-nodes-as-list ()
  (interactive)
(let
    (
     (candidates (hermanhel-strings-to-hash (org-roam--get-titles)))
     (selected-nodes (citar--select-multiple "References: " candidates))
     )
(dolist (title selected-nodes)
      (insert "+ " "[[roam:" title "]]" "\n")
      )
)
)

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

Table of Contents

1. core idea

2. In practic

2.1. getting F and B

2.2. Fitting in reward -> \(z_r\)

3. Comment

Backlinks

::: Fast Imitation via Behavior Foundation Models

::: Fast Imitation via Behavior Foundation Models

::: Fast Imitation via Behavior Foundation Models

(Method)

Backlinks

::: Fast Imitation via Behavior Foundation Models

::: Fast Imitation via Behavior Foundation Models

::: Fast Imitation via Behavior Foundation Models

(Method)