My first impression of RL comes from AlphaGo:
where policy and value networks are used. I realized this is a valuable idea since long term memory could be phrased and guide the behaviors, even at the sacrifice of suffering in the short term.
A typical formulation to train the policy network is given in AlphaGp learning:
where ρ weight, s is state, v is state value, and z is ground truth and (z-v) is to minimize MSE (although I’m not certain how they derive this term z-v instead of v-z).
It’s mentioned in Sutton’s book “Reinforcement learning: An introduction” the policy network is related to evolutionary computing, where no value function appears. But evolutionary computing requires the search space is sufficiently small, and takes no requirement of sensor interactions.
There’re two confusions to me right now:
- Why sensor/agent interaction is important?
- Usually the RL requires a search tree (Chapter 1, Sutton’s book):
and it’s not always possible to build a search tree for all applications, e.g., you have a 3D image and a model, then the search tree for fitting the model to the image will become extremely large.