policy

Backlinks

policy in imitation learning and reinforcement learning context refers to a map from state to action, with the semantic of action a to perform in state s \[ \tau(s) = a \text{ or } \tau(s) = P(a) = \{P(a_1) = .1, P(a_2) = .2, P(a_3) = .01, ... P(a_n) = .09\} \] could be probablistic mapping(meaning map \(s\) to a list of probability values of each action being selected.)

A matrix indexed by states and actions.

Backlinks

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

Learn the representation of policys from any start state to any goal state. Idea is that with this policy, you optimize the chance to get to that state. Then, when reward is used along with this representation, the policies that leads to desireble states would be emphasized in the following Q-function computed from the representations. An intermediate representation (policy parameter) is used to carry this emphasis.

temporal-difference learning

a way of evaluating policy: use experience(sequence of reward \(\{R_{t+1},R_{t+2}...\}\)) to predict/estiamte

A markov decision process is a 4 tuple:

\(S\) - set of states, the state space
\(A\) - set of actions, the action space:
\(P(s'|s,a)\) - probability that action \(a\) in state \(s\) leads to \(s^'\) in next time step
\(R_a(s,s')\) - immediate reward of transisting from state \(s\) to state \(s^'\) due to action \(a\).

It is used to represent a problem, and often the representation is used to optimize a policy for the decision-maker.

policy

Table of Contents

Backlinks

Touati, Ahmed and Ollivier, Yann ::: Learning One Representation to Optimize All Rewards

temporal-difference learning