Dynamic Programming
Introduction
Dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a MDP
動態編程(DP)是指算法的集合,可在給定環境完美模型(如MDP)的情況下用於計算最佳策略
Consider the environment as a finite MDP, and its dynamics are given by a set of probability p(s’, r|s, a) for all s ∈ 𝒮, a ∈𝒜(𝑠), r ∈ ℛ, and s’ ∈ 𝒮+
將環境視為有限的MDP,其動態性由一組概率給出
The key idea of DP is the use of value functions to organize and structure the search for good policies.
DP的關鍵思想是使用value functions(價值功能)來組織和建構對良好(policies)策略的搜索。
We can easily obtain optimal policies once we have found the optimal value functions which satisfy the Bellman optimality equations
一旦找到滿足Bellman最優性方程的最優值函數,就可以輕鬆獲得最優策略。

The key idea of DP is the use of value functions to organize and structure the search for good policies.
DP的關鍵思想是使用value functions(價值功能)來組織和建構對良好(policies)策略的搜索。
We can easily obtain optimal policies once we have found the optimal value functions which satisfy the Bellman optimality equations
一旦找到滿足Bellman最優性方程的最優值函數,就可以輕鬆獲得最優策略。