Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
开发者_StackOverflow中文版 Improve this questionalt text http://img693.imageshack.us/img693/724/markov.png
I'm a bit confused about some points here:
- What does it mean to say that it will be successful 70% of the time he tries a given action? Does it mean that every time he tries to perform an action A, it will 70% of the time do that action A and the other 30% do the action that leads to the same state, or just that it is as if he always did the action A, but just 30% of the times he just doesn't does it? I hope I am making myself clear :(
- How is it possible to have several consecutive states with the same utility? In theory the utility shouldn't always decrease, the farther you are from states with a reward?
- Knowing only the info I gave above, is it possible to infer what is the discount factor(gamma)? If yes, how?
- Is it possible to calculate the Reward for the states? How?
There is a pattern to dealing with most MDP problems, but I think you've probably omitted some information from the problem description, most likely it has to do with the state you're trying to reach, or the way an episode ends (what happens if you run off the edge of the grid). I've done my best to answer your questions, but I've appended a primer on the process I use to deal with these types of problems.
Firstly utility is a fairly abstract measure of how much you want to be in a given state. It's definitely possible to have two states with equal utility, even when you measure utility with simple heuristics (Euclidean or Manhattan distance). In this case, I assume that the utility value and reward are interchangeable.
In the long term, the objective in these types of problems tends to be, how do you maximise your expected (long term) reward? The learning rate, gamma, controls how much emphasis you place on the current state versus where you would like to end up - effectively you can think of gamma as a spectrum going from, 'do the thing the benefits me most in this timestep' to at the other extreme 'explore all my options, and go back to the best one'. Sutton and Barto in there book on reinforcement learning have some really nice explanations of how this works.
Before you get started, go back through the question and make sure that you can confidently answer the following questions.
- What is a state? How many states are there?
- What is an action? How many actions are there?
- If you start in state u, and you apply an action a, what is the probability of reaching a new state v?
So the answers to the questions?
- A state is a vector (x,y). The grid is 5 by 5, so there are 25 states.
- There are four possible actions, {E,N,S,W}
- The probability of successfully reaching an adjacent state after applying a suitable action is 0.7, the probability of not moving (staying in the same state is 0.3). Assuming (0,0) is the top left cell and (4,4) is the bottom right cell, the following table shows a small subset of all of the possible transitions.
Start State Action Final State Probability --------------------------------------------------- (0,0) E (0,0) 0.3 (0,0) E (1,0) 0.7 (0,0) E (2,0) 0 ... (0,0) E (0,1) 0 ... (0,0) E (4,4) 0 (0,0) N (0,0) 0.3 ... (4,4) W (3,4) 0.7 (4,4) W (4,4) 0.3
How can we check that this makes sense for this problem?
- Check that the table has an appropriate number of entries. On a 5 by 5 grid there are 25 states and 4 actions, so the table should have 100 entries.
- Check to make sure that for an start state / action pair, only two entries have non-zero probability of occuring.
Edit. answering the request for the transition probabilities to the target state. The notation below assumes
- v is the final state
- u is the source state
- a is the action, where it isn't mentioned, it's implied that the action applied isn't relevant.
P( v=(3,3) | u =(2,3), a=E ) = 0.7 P( v=(3,3) | u =(4,3), a=W ) = 0.7 P( v=(3,3) | u =(3,2), a=N ) = 0.7 P( v=(3,3) | u =(3,4), a=S ) = 0.7 P( v=(3,3) | u =(3,3) ) = 0.3
ad.1) probably it is not that robot has always to move -- i.e. those 30% are "ah, now I rest a bit" or "there was no power to move at all".
I've formulated this problem as a Finite-Horizon Markov Decision Process and solved it via Policy Iteration. To the right of each iteration, there is a color-coded grid representation of the recommended actions for each state as well as the original reward grid/matrix.
Review the final policy/strategy at Stage 4. Does it agree with your intuition?
精彩评论