Q-function

1. refernce
Bibliography
Backlinks
- Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards
  - (core idea)
Backlinks
- Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards
  - (core idea)

Q-learning learning process to learn Q-function, which is defined by: \[ Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha[R_{t+1} + \gamma \max_aQ(S_{t+1},a) - Q(S_t,A_t)] \]

initilialize

The initial value is arbitarily generated, except for the terminal state, for which its initial value is 0.

parameters

\(\alpha\): stepsize, \(\in (0,1]\)
\(\gamma\): discount factor, \(\in (0,1]\)

(no term)

it is a method, as a existing analysis/estimation using the system(Q-function) is used in the update term.

1. refernce

https://richard-warren.github.io/blog/rl_intro_3/ [1] chapter 6.5

Bibliography

[1]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Second edition. in Adaptive computation and machine learning series. Cambridge, Massachusetts: The MIT Press, 2018.

Backlinks

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

(core idea)

find, somehow, F_z and B with respect to a Q-function and on the states and actions(the problem). F would give you a parameterized policy for all \(\mathbf{R}^d\), and B can be combined with reward to give you the index/parameter to determine the exact optimal policy from the lots of policies derived from \(F\).

Thus, F represents all of the futures from one state(to another), and B represents a way of getting there

Backlinks

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

(core idea)

Thus, F represents all of the futures from one state(to another), and B represents a way of getting there

Q-function

Table of Contents

1. refernce

Bibliography

Backlinks

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

(core idea)

Backlinks

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

(core idea)