Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

Table of Contents

Learn the representation of policys from any start state to any goal state. Idea is that with this policy, you optimize the chance to get to that state. Then, when reward is used along with this representation, the policies that leads to desireble states would be emphasized in the following Q-function computed from the representations. An intermediate representation (policy parameter) is used to carry this emphasis.

1. core idea

find, somehow, Fz and B with respect to a Q-function and on the states and actions(the problem). F would give you a parameterized policy for all \(\mathbf{R}^d\), and B can be combined with reward to give you the index/parameter to determine the exact optimal policy from the lots of policies derived from \(F\).

Thus, F represents all of the futures from one state(to another), and B represents a way of getting there

2. In practic

2.1. getting F and B

The are learnt together with randomly sampled states and transitions from a dataset, as neural networks. F would be one the recieve 3 inputs, and return #action outputs (a vector of #action cardinality) B 2 input, but return the same amount, #action outputs (a vector of #action cardinality)

2.2. Fitting in reward -> \(z_r\)

\(z_r\) is a vector of #action cardinality, that is the expectation of the formula \(r(s,a)B(s,a)\), over s and a’s used in the training.(or following the exploration policy)

semantic of it would be policy to get to a state, weighted by reward of individual state.

3. Comment

This paper is full of resources around the Forward Backward framework:

  • lots of propositoins and remarks, realy discussing this framework
  • lots of math, definitions and remarks on implementation

Backlinks

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

(Method)

using a behaviour foundation model, with lots of tweaks and techniques:

Here’s a script to insert multiple org-roam nodes

(defun hermanhel-strings-to-hash (strings)
  "Convert a list of STRINGS to a hash table with the strings as keys."
  (let ((hash (make-hash-table :test 'equal)))
    (dolist (str strings)
      (puthash str t hash))
    hash))

(defun hermanhel-org-roam-insert-multiple-nodes-as-list ()
  (interactive)
(let
    (
     (candidates (hermanhel-strings-to-hash (org-roam--get-titles)))
     (selected-nodes (citar--select-multiple "References: " candidates))
     )
(dolist (title selected-nodes)
      (insert "+ " "[[roam:" title "]]" "\n")
      )
)
)

Backlinks

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

a proof of concept using sucessor measures(namely forward-backward framewor forward-backward framework)

They basically learn a set of [F,B,Cov B, \(\pi_z\)](which aside from Cov B is expectation of \(B(s)B(s)^T\) over the distribution of time on each state s when performing \(\pi_z\) are calculated by forward-backward framework.) first, with an algorithm in (NO_ITEM_DATA:touatiDoesZeroShotReinforcement2023)

::: Fast Imitation via Behavior Foundation Models

(Method)

using a behaviour foundation model, with lots of tweaks and techniques:

Here’s a script to insert multiple org-roam nodes

(defun hermanhel-strings-to-hash (strings)
  "Convert a list of STRINGS to a hash table with the strings as keys."
  (let ((hash (make-hash-table :test 'equal)))
    (dolist (str strings)
      (puthash str t hash))
    hash))

(defun hermanhel-org-roam-insert-multiple-nodes-as-list ()
  (interactive)
(let
    (
     (candidates (hermanhel-strings-to-hash (org-roam--get-titles)))
     (selected-nodes (citar--select-multiple "References: " candidates))
     )
(dolist (title selected-nodes)
      (insert "+ " "[[roam:" title "]]" "\n")
      )
)
)

Author: Linfeng He

Created: 2024-04-03 Wed 19:37