注册 登录
湿热一瞬间 返回首页

-s6的个人空间 https://www.shireyishunjian.com/main/?220791 [收藏] [复制] [分享] [RSS] 小马是真理,道路与生命

日志

Explain how temporal difference learning allows the Bellman equation to be solve ...

已有 77 次阅读2023-4-26 16:54 |个人分类:论文

The Bellman equation refers to a way to estimate the value of the current Markov state in reinforcement learning. It states that the value of a state equals the sum of the reward of the immediate next state plus the temporally discounted value of all future states. Once a complete state-value function is achieved, a greedy agent should be able to maximize its received reward.

The problem with using the Bellman equation to estimate the state-value function is that it is recursively dependent on the policy. For example, consider, in a grid-world scenario, there exists a state next to the reward state. A model-free agent using an epsilon-greedy policy with no knowledge of the environment would have a 25% chance of receiving the reward, as it is equally likely to move in either of the four directions. Assuming the reward is 1, the value of that state would, according to the Bellman equation, simply be 0.25 due to the distribution of possible actions. However, the knowledge of the reward changes the agent’s behaviour so that it could now successfully go to the reward state at a probability of (1 – epsilon), changing the value of that next-to-reward state in the process. Therefore, when the Bellman equation is used in practice, there often exists a cycle of improving the policy and making another calculation of the state-value function. This approach is known as dynamic programming.

Dynamic programming should eventually converge on the optimal policy and its corresponding state-value function, but this process can be simplified by temporal difference (TD) learning. TD learning refers to the procedure of computing a TD prediction error to update the state-value function, without explicitly calculating the state-value function using the Bellman equation. The TD prediction error is given by the reward of the immediate next states (under the current policy), plus the temporally discounted value of the immediate next states, minus the value estimation of the current state. This prediction error is multiplied by a learning rate parameter, and then simply added to the value estimation of the current state. When the policy does not change, this prediction error computation would lead to the same result as the value estimation of the Bellman equation. This is because the target of the value update in TD learning includes both the immediate reward term (which is part of the Bellman equation) and the discounted future value term. Even though TD learning only considers the value of the immediate next state, over multiple epochs, the value of the next state will include the value of the second next state, and so on. In practice, however, the value update will lead to a change in the policy, and the estimation of the TD learning only converges to the result of the Bellman equation when an optimal (greedy) policy is found. By considering only the immediate next state, the value estimation of TD learning is technically always slightly incorrect, but this comes with the benefit of avoiding lengthy calculations of the values of all future states after each update.

There is evidence suggesting that biological neural systems use a mechanism similar to TD learning. Specifically, they compute the TD prediction error using midbrain dopamine neurons. The midbrain dopamine neurons show activation only when there is an unexpected reward. When the reward is expected, the dopamine response instead becomes contingent on the unexpected occurrence of the predictor of that reward. This is congruent with the TD prediction error. The activation for unexpected reward corresponds to the immediate reward term in the TD prediction error computation, while the activation for the unexpected predictor of the reward corresponds to the next-state-value term in the computation.

Current evidence also suggests that TD learning is implemented in biological neural systems in the form of the actor-critic model. The actor-critic model is one of the many ways TD learning could be implemented, and it may be the one used by biological neural systems due to its ease of computation. The actor-critic model maintains two representations: a state-value function and a policy. The TD prediction error changes both the state-value function and the policy, but only the state-value function is used for the computation of the prediction error. The policy thus does not directly depend on the state-value function, and there is no need for the system to derive a new policy based on the updated state-value function after every cycle. The implementation of the actor-critic model in the neural system is implicated by the separate neurological mechanisms of Pavlovian conditioning and instrumental conditioning. It was found that a prediction error from instrumental conditioning leads to blood-oxygen-level-dependent (BOLD) responses in the dorsal and ventral striatum, while the effect of a Pavlovian prediction error is limited to the ventral striatum. This is interpreted as evidence that the dorsal striatum, being more active in the motor execution, maintains the policy. The ventral striatum on the other hand maintains the state-value function. As predicted, a TD prediction error of actions changes both the state-value function and the policy, but non-reinforced information of the world is only recorded by the state-value function.

In conclusion, TD learning allows an approximation to the solution of the Bellman equation to be achieved by considering only the immediate next state rather than all future states. The backup of temporally discounted state values allows the value of a state in TD learning to be indirectly influenced by all future states, allowing it to achieve a similar result as the Bellman equation. The evidence for the implementation of TD learning in neural systems comes from the dopamine activities resembling TD prediction errors, and that the distinction of the dorsal and ventral striatum resembles the actor-critic architecture, which is an implementation of TD learning.


献花

求抱

捏脸

开心

伤心

无语

天哪

鄙视

评论 (0 个评论)

facelist

您需要登录后才可以评论 登录 | 注册

手机版|小黑屋|湿热一瞬间

GMT+8, 2024-4-28 08:03 , Processed in 0.019662 second(s), 8 queries , Redis On.

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

返回顶部