I am currently using Q-Learning to try to teach a bot how to move in a room filled with walls/obstacles. It must start in any place in the room and get to the goal state(this might be, to the开发者_如何学Python tile that has a door, for example). Currently when it wants to move to another tile, it will go to that tile, but I was thinking that in the future I might add a random chance of going to another tile, instead of that. It can only move up, down, left and right. Reaching the goal state yields +100 and the rest of the actions will yield 0.
I am using the algorithm found here, which can be seen in the image bellow.
Now, regarding this, I have some questions:
- When using Q-Learning, a bit like Neural Networks, I must make distinction between a learning phase and a using phase? I mean, it seems that what they shown on the first picture is a learning one and in the second picture a using one.
- I read somewhere that it'd take an infinite number of steps to reach to the optimum Q values table. Is that true? I'd say that isn't true, but I must be missing something here.
I've heard also about TD(Temporal Differences), which seems to be represented by the following expression:
Q(a, s) = Q(a, s) * alpha * [R(a, s) + gamma * Max { Q(a', s' } - Q(a, s)]
which for alpha = 1, just seems the one shown first in the picture. What difference does that gamma make, here?
- I have run in some complications if I try a very big room(300x200 pixels, for example). As it essentially runs randomly, if the room is very big then it will take a lot of time to go randomly from the first state to the goal state. What methods can I use to speed it up? I thought maybe having a table filled with trues and falses, regarding whatever I have in that episode already been in that state or not. If yes, I'd discard it, if no, I'd go there. If I had already been in all those states, then I'd go to a random one. This way, it'd be just like what am I doing now, knowing that I'd repeat states a less often that I currently do.
- I'd like to try something else than my lookup table for Q-Values, so I was thinking in using Neural Networks with back-propagation for this. I will probably try having a Neural Network for each action (up, down, left, right), as it seems it's what yields best results. Are there any other methods (besides SVM, that seem way too hard to implement myself) that I could use and implement that'd give me good Q-Values function approximation?
- Do you think Genetic Algorithms would yield good results in this situation, using the Q-Values matrix as the basis for it? How could I test my fitness function? It gives me the impression that GA are generally used for things way more random/complex. If we watch carefully we will notice that the Q-Values follow a clear trend - having the higher Q values near the goal and lower ones the farther away you are from them. Going to try to reach that conclusion by GA probably would take way too long?
I'm not an expert on the topic, but I'll take a crack at responding directly at your many questions
[BTW, I should get multi +reps for each question!... Just kidding, if I was in "for the SO reps", I'd stay clear from posting which will get a grand total of 20 views with half of these visitors having an rough idea of the concepts at hand]
1) Q-Learning a two-phase thing?
Yes, Q-Learning implies two phases, a learning phase and an action phase. As with many automated learning algorithms it is possible to "keep on learning" while in the action phase.
2) Infinite number of steps for an optimal G matrix? Not sure where the statement requiring an infinite number of learning cycles to learn an optimal Q matrix. To be sure (and unless the alpha and gamma factors are incorrect), the algorithm converges, if only at a possibly very slow rate. This prompts me to skip and comment on your idea of a 300x200 game space, and well... YES!, for such a space, an given the reward model, it will take what seems to infinity to get an "optimal" Q table. Now, it may be possible that mathematically the algorithm never reaches the optimal nivarna, but for practical solutions, working on the asymptote is just good enough.
3) Role of gamma in TD model
This indicates the importance of deferring rewards, on a path (here with your model, literally), towards higher rewards. This generally prevents the algorithm of getting stuck in local maximas of the solution space, at the cost of making learning even slower...
4) Suggestions to help with learning a big maze
At the risk of betraying the nature of Q-Learning, you can start the robot at increasingly further distances from the goal. This will help it improve the Q Matrix in the area of the states which surround the goal first, then leveraging this partially learned Q matrix as the initial state taken, randomly, within an increasing radius from the goal.
Another, riskier, approach (and indeed one that may further belie the true nature of Q-Learning), would be to change the R Matrix to provide increasingly high rewards, at random placed located at a decreasing distance from the goal. The downside to this approach is that it may introduce opportunities of many local maximas in the solution space, where the algorithm may get stuck if the learning rate and other factors are not tweaked properly.
Both of these approaches in particular the latter can be interpreted as a your (the designer) "wiring" in a solution. Other will say that this is merely as way of introducing a dash of DP into the mix...
5) Neural Net (NN) 6) Genetic Algorithm (GA)
No opinion about adding NN or GA into the mix.
I probably made enough of a fool of myself with some of the less-than-mathematically-accurate statement above. ;-)
-You should try to change alpha and gamma values. They are important parameters.
-Try more episodes.
-changes values of exploration. Too much exploration of exploration is not good. And not enough exploration is not goog.
精彩评论