mdp - When to use Policy Iteration instead of Value Iteration -
i'm studying dynamic programming solutions markov decision processes. sense i've got decent grip on vi , pi , motivation pi pretty clear me (converging on right state utilities seems unnecessary work when need right policy). however, none of experiments show pi in favourable lite in terms of runtime. seems consistently take longer regardless of size of state space , discount factor.
this due implementation (i'm using burlap library), or poor experimentation on part. however, trends don't seem show benefit. should noted burlap implementation of pi "modified policy iteration" runs limited vi variant @ each iteration. question know of situations, theoretical or practical, in (modified) pi should outperform vi?
turns out policy iteration, modified policy iteration, can outperform value iteration when discount factor (gamma) high.
http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a.pdf
mdp
No comments:
Post a Comment