On reinforcement learning

· deep learning

My first impression of RL comes from AlphaGo:


where policy and value networks are used. I realized this is a valuable idea since long term memory could be phrased and guide the behaviors, even at the sacrifice of suffering in the short term.

A typical formulation to train the policy network is given in AlphaGp learning:

where ρ weight, s is state, v is state value, and z is ground truth and (z-v) is to minimize MSE (although I’m not certain how they derive this term z-v instead of v-z).

It’s mentioned in Sutton’s book “Reinforcement learning: An introduction” the policy network is related to evolutionary computing, where no value function appears. But evolutionary computing requires the search space is sufficiently small, and takes no requirement of sensor interactions.

There’re two confusions to me right now:

  1. Why sensor/agent interaction is important?
  2. Usually the RL requires a search tree (Chapter 1, Sutton’s book):






and it’s not always possible to build a search tree for all applications, e.g., you have a 3D image and a model, then the search tree for fitting the model to the image will become extremely large.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: