Monte Carlo and Temporal Difference Methods in Reinforcement Learning
An interactive article on introducing Monte Carlo and Temporal Difference approach on a grid world. Learn about the Monte Carlo and Temporal Difference approaches and their differences.
Under Review
Indicates interactive elements
I. Introduction
Learning through interaction with the environment is a natural form of intelligence observed in both humans and animals, who perceive the environment’s state, take actions based on their perception and understanding, and learn from the feedback they receive. Reinforcement learning (RL) [1] is a computational approach based on this learning mechanism. RL recently showed impressive results in various domains, such as Go [2], StarCraft [3], and protein-folding problems [4].
RL comprises four main elements:
- Policy: This is the probability distribution of actions in a given state that determines the agent’s behavior
- Reward: This is a signal given to an agent based on its behavior or actions in an environment. The reward serves as a way to reinforce or encourage certain behaviors and discourage others and determines a goal in an RL problem.
- Value function: This is the total reward an agent will get in the future, starting from a specific state. It shows how good a state is in the long run.
- Environment model: This model makes predictions of the environment’s behavior, thereby enabling inference about how the environment will behave.
RL is a machine learning technique that allows an agent to learn decision-making by interacting with the environment through trial and error. In each step, the agent’s policy decides on an action based on the current state. The environment then responds with the next state and a reward. The episode terminates if the agent reaches the terminal state. To improve its decision-making, the agent’s policy can be updated based on the value function that estimates the value of a state or an action. However, accurately predicting the future rewards without visiting all the possible states can be challenging.
RL algorithms are categorized into two types: model-free and model-based types. Model-free algorithms estimate the value function from multiple trials and update the policy based on these values. In contrast, model-based algorithms use planning with the environment model [1].
This article focuses on two fundamental model-free algorithms: Monte Carlo (MC) and Temporal Difference (TD). These algorithms can effectively solve various RL problems. Agents are trained and evaluated in a grid-based environment, where the agent starts from the upper left grid and aims to reach the lower right grid.
II. Markov Decision Process
RL is a widely used machine learning methodology that focuses on solving sequential decision-making problems, in which the current decisions affect the future outcomes. The Markov decision process (MDP) [11] is one of the most popular methods of formalizing RL, providing a useful framework for solving sequential decision-making problems.
The MDP is defined by states, actions, transition probabilities, and reward function. Understanding the MDP is crucial considering that the goal of RL is to find an optimal policy that maximizes the total rewards or evaluates the value of a set of actions given the MDP’s rules and constraints. The MDP extends the concept of a Markov process, a stochastic process that only comprises states and transition probabilities.
Markov Process
A stochastic process that models a system's evolution over time is called a Markov process. The transition probability of a first-order Markov process is defined by the conditional probabilities that consider only the most recent previous state. This characteristic is governed by the Markov Property. The Markov Property suggests that the future state of a system solely relies on the current state, completely disregarding any influence from the previous states.
$$P\left[s_{t+1} \mid s_t\right]= P\left[s_{t+1} \mid s_1, s_2, \ldots, s_t\right]$$
Figure 3 depicts an example [5] of a Markov process.
A child can be in five possible sleep states: \(s_0\) (lying), \(s_1\) (awake), \(s_2\) (drowsy), \(s_3\) (light sleep), and \(s_4\) (deep sleep). Each state lasts for a unit of time (e.g., 1 min), and then transitions to the next state based on certain probabilities. Assuming that the Markov process always starts with the child \(s_0\) (lying), the child remains in this state for 1 min. Next is the 40% probability of transitioning to \(s_1\) (awake) or the 60% probability of transitioning to \(s_2\) (drowsy). This process continues until the child reaches \(s_4\) (deep sleep), ending the Markov process.
A Markov process is composed of the following components:
- State space (\(S\)): This represents all possible states that a system can take. In our example, the states are \(s_0\) (lying), \(s_1\) (awake), \(s_2\) (drowsy), \(s_3\) (light sleep), and \(s_4\) (deep sleep) represented as \(\{s_0, s_1, s_2, s_3, s_4\} \in S\).
- Transition probability (\(P_{ss'}\)): This is the probability of transitioning from state \(s\) to another state \(s'\) at step \(t\) represented as \(P_{ss'} = P(S_{t+1} = s' \mid S_{t} = s)\). The transition probability table is provided as follows:
\[ \begin{array}{c c} & \begin{array}{c c c c c} s_0 & s_1 & s_2 & s_3 & s_4 \\ \end{array} \\ P_{ss'} = \begin{array}{c c c}s_0 \\ s_1 \\ s_2 \\ s_3 \\ s_4 \end{array} & \left[ \begin{array}{c c c} & 0.4 & 0.6 & & \\ 0.1 & 0.9 & & & \\ & & & 0.7 & 0.3 \\ & & & & 1.0 \\ & & & & 1.0 \\ \end{array} \right] \end{array} \]
Where the transition probability from $$s_4$$ to $$s_4$$ is denoted by 1 in the table. However, note that no state transition can be made after reaching $$s_4$$ because it is a terminal state. Thus, the self-edge on $$s_4$$ is not presented in Figure 3.
Markov Decision Process
The MDP extends a Markov process by incorporating decision-making capabilities, setting it apart from the latter. While a Markov process models a system in which the future state is determined only by the current state, the MDP introduces the concept of actions taken by an agent, which can influence the next system state. This enables the MDP to model real-world scenarios better.
In the MDP, the agent chooses an action based on its current state and the reward associated with the action at each time step. The subsequent state is then established depending on the transition probability. The agent aims to learn a policy that maps states to actions maximizing the expected cumulative reward over time. The agent can efficiently navigate the system and achieve its objectives by making decisions based on this policy. Figure 5 provides an example of the MDP.
The MDP comprises the following components:
- State Space (\(S\)): This represents all the possible states that a system can take. In our example, the states are \(s_0\) (lying), \(s_1\) (awake), \(s_2\) (drowsy), \(s_3\) (light sleep), and \(s_4\) (deep sleep), which can be represented as \(\{s_0, s_1, s_2, s_3, s_4\} \in S\).
- Action Space (\(A\)): This denotes all the possible actions that can be taken given a state \(s\), including singing a lullaby and playing with the child, represented as \(\{a_0, a_1\} \in A\).
- Transition Probability (\(P_{ss'}^a\)): This is the probability of transitioning from state \(s\) to another state \(s'\) after taking action \(a\) at step \(t\). This is represented by \(P_{ss'}^a = P(S_{t+1} = s' \mid S_{t} = s, A_{t}=a)\).
- Reward function (\(R\)): The reward function in this MDP can evaluate the desirability of different actions by providing a scalar value, called Reward (\(R_t)\), at a given step \(t\).
- Discount Factor (\(\gamma\)): This is a value between 0 and 1 that determines the importance of immediate rewards versus long-term ones.
The MDP differs from Markov processes in various aspects, including a goal through a reward. In the context of helping a child sleep, a reward function can be set to encourage good sleeping patterns. Alternatively, if the objective is to play with the child, the MDP may reward the state of being awake. Being given the MDP means the need to determine the optimal action to take in each state to maximize the total rewards.
For instance, \(a_1\) (playing with child) in \(s_0\) (lying) immediately transitions to state \(s_1\) (awake) and receives positive rewards because the child wants to play. However, in the long-term, a continuous \(a_0\) (singing a lullaby) action to put the child to sleep will yield the highest total sum of rewards. Interestingly, taking the \(a_0\) (playing with child) action while the child is in \(s_2\) (drowsy) may lead to a transition to the awake or lying state depending on the given probability. This transition probability helps model problems in a realistic manner.
Maximizing the sum of the rewards $$R_t$$ by time $$t$$ in the MDP requires evaluating the value of each state and determining the action to take in each state $$s$$. The state value is determined by the expected sum of the discounted future rewards that can be obtained by taking a particular action in that state. The discounting factor $$\gamma$$ is used to ensure that immediate rewards are more highly valued than delayed ones. This expected sum of future rewards is called the return $$G_t$$.
However, the expected return value must be used because the return value changes every time due to probabilistic factors, such as transitions. The function that outputs the expected return value for each state is called the value function (\(V_\pi\)). The value function is a critical concept in RL because it enables an agent to evaluate the value of different states and determine the optimal policy to follow when maximizing rewards. \[V(s) = \mathbb{E}\left[G_t \mid S_t=s\right]=\mathbb{E}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t=s\right], \text { for all } s \in \mathcal{S}\]
The transition probability and the reward function are unknown in a model-free environment; therefore, they must be estimated from experience. The two primary methods for the value function estimation in a model-free environment are the MC and TD methods.
SharkGame Environment
Viewing SharkGame as an MDP, the states represent the shark character’s position, and the actions involved are moving up, down, left, and right. The probability of transitioning to a new state is deterministic. In other words, the action will always result in a specific next state. If the agent is at the top-left corner and chooses to move right, it will always transition to the state on the right without any chance of moving to the state above, below, or on the left side. The reward function provides a negative feedback for each step and obstacle encountered (e.g., bombs and nets). This approach encourages the agent to find the shortest path to the goal while avoiding the obstacles.
The value function of an environment with a small and discrete state space (e.g., SharkGame) can be represented by the table method. The table method is a technique for storing and updating the values for each state or state–action pair in a tabular form. The learned policy must select actions leading to states with the highest value among the options available from the table. In all subsequent examples, the following random agent is employed to compute the true value and approximate it.
Assuming complete knowledge of the MDP, the true value approximated using both the MC and TD methods can theoretically be calculated. The value of each state is calculated as follows: $$ \begin{aligned} V(s)& =\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots\right] \\[5pt] & =\mathbb{E}\left[R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\cdots\right)\right] \\[5pt] & =\mathbb{E}\left[R_{t+1}+\gamma V\left(s_{t+1}\right)\right] \\[1pt] & =\sum_{a \in A} \pi(a \mid s)\left(R_t+\gamma \sum_{s^{\prime} \in S} P_{s s^{\prime}}^a V\left(s^{\prime}\right)\right) \end{aligned} $$ The value for each state in the table is calculated by repeatedly iterating this equation.
In the SharkGame environment, where the agent receives negative rewards at each step, the gains achieved by reaching the destination must be prioritized. Doing this requires setting a high discount factor (close to 1) that will cause the agent to emphasize the significance of long-term rewards. For example, in a 6 x 6 grid, in which reaching the destination requires many steps, a 0.999 or higher discount factor is necessary to guide the agent toward the destination. By contrast, in a 5 x 5 grid, a 0.995 or higher discount factor is sufficient. The agent retains its initial location if the gains obtained from reaching the destination fail to spread back to the initial position. The random policy does not avoid traps, thereby developing an entrapment risk. This risk is factored into the above formula. It affects the true values and spreads the traps’ negative rewards across nearby spaces. Consequently, agents might accidentally encounter less dangerous traps while avoiding more harmful ones.
III. Monte Carlo approach
The MC method estimates the values by simply averaging the sample returns. This technique relies on simulating the multiple episodes of an agent’s interaction with the environment. In each episode, the agent takes actions and receives rewards. The sequence of states, actions, and rewards is also recorded. The final outcome of each episode is used to update each state value. The MC approach estimates a state value as the average of the returns received in all episodes, in which that state has been visited. The return is the sum of the rewards received after visiting that state until the end of the episode. The estimated state value becomes more accurate as the number of episodes increases.
The sample return $$G_t$$ is computed at each step after each episode.
$$G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1} R_{T}$$
For each visited state in the episode, the increment counter $$N(s_t)$$, which refers to the number of visits to state $$s_t$$, is increased by 1. The sample return $$G_t$$ is added to the total return $$S(s_t)$$. Finally, dividing $$S(s_t)$$ by $$N(s_t)$$, the state value $$V(s_t)$$ is obtained as follows:
\[N(s_t) = N(s_t) + 1\] \[S(s_t) = S(s_t) + G_t\] \[V(s_t) = S(s_t) / N(s_t)\]
Set values of reward sequence, incremental counter, total return, and discount factor in the interactive formula below, and observe how the new value is calculated.
\[R = \]
\[\gamma = \]
0 1$$ N(s_t) = $$
$$ S(s_t) = $$
However, the incremental formulas of the MC update rule can be easily derived. These formulas provide an efficient method of computing the state value without maintenance and process all returns each time a state is visited. The idea here is to keep a running average, update it each time a new return is available instead of summing all the returns, and divide it by the total number of visits each time.
The estimate is updated as follows by considering the previous estimate $$V(s_t)$$, the new sample return $$G_{N(s_t)}$$, and the number of visits to a state $$N(s_t)$$ before the current episode:
$$ \begin{aligned} V_{new}(s_t) &= S(s_t) / N(s_t) \\ &= \frac{1}{N(s_t)} \sum_{i=0}^{N(s_t)} G_{i} \\ &= \frac{1}{N(s_t)} (G_{N(s_t)} + \sum_{i=0}^{N(s_t)-1} G_{i}) \\ &= \frac{1}{N(s_t)} (G_{N(s_t)} + (N(s_t) - 1) \frac{1}{(N(s_t) - 1)} \sum_{i=0}^{N(s_t)-1} G_{i}) \\ &= \frac{1}{N(s_t)} (G_{N(s_t)} + (N(s_t) - 1) V(s_t)) \\ &= \frac{1}{N(s_t)} (G_{N(s_t)} + N(s_t)V(s_t) - V(s_t)) \\ &= V(s_t) + \frac{1}{N(s_t)}(G_{N(s_t)} - V(s_t)) \\ \end{aligned} $$
The term $$\frac{1}{N(s_t)}(G_{N(s_t)} - V(s_t))$$ represents the difference between the new return $$G_{N(s_t)}$$ and the current estimated value $$V(s_t)$$ scaled by the inverse of the number of visits. This difference, which is also known as error, quantifies how far the new return deviates from our current estimate. The estimate is then adjusted in the error’s direction. Consequently, an excessively high current estimate will be decreased, while an excessively low one will be increased. The more frequent a state is visited, the less the estimate is influenced by new returns.
A practical and useful modification that can be made to the MC update formula is to replace the visit count inverse with a step-size parameter $$\alpha$$. This change offers additional control over the learning process.
\[V_{new}(s_t) = V(s_t) + \alpha (G_{t} - V(s_t))\]
The step-size parameter $$\alpha$$ is crucial because it determines the extent to which our estimates are updated based on new information. At an $$\alpha$$ close to 0, updates are small, and learning is slow. Conversely, at an $$\alpha$$ close to 1, updates are considerable, and learning is fast; however, it may lead to the overshooting of the true values. Typically, $$\alpha$$ is chosen as a small positive number for stability and gradual learning.
The MC approach has a few disadvantages. One is that it requires complete episodes of an interaction with the environment before it can update the state values. This can be time-consuming and computationally expensive, especially for problems with long or infinite episodes. The MC method can also suffer from a high variance in the value estimates, particularly when the episodes are short or few. This variance can make it difficult for the agent to learn the true state values and lead to a slower convergence. The TD approach addresses the abovementioned issues, making value learning more stable and efficient.
IV. Temporal Difference learning
The TD approach is another method for estimating the value function in RL. Unlike the MC method that requires waiting until the end of an episode to update the value function, the TD method updates the value function at every transition. The MC method requires the complete return of an episode to update the value function; however, this return is not known until the end of that episode. Using the TD method avoids the waiting time because it updates the value function before the end of an episode.
The TD method achieves its goal using the bootstrap method that uses its own value estimates to update the value function. The TD approach specifically updates the current value function $$V(s_t)$$ with a more accurate estimate $$R_{t+1} + \gamma V(s_{t+1})$$ and considers the sampled reward received by the agent, thereby allowing it to incrementally update the value function estimate in real time without needing to wait for the end of an episode.
\[V\left(s_t\right)=\mathbb{E}\left[R_{t+1}+\gamma V\left(s_{t+1}\right)\right]\]
Similar to the MC method, the TD method also uses a step-size parameter ($$\alpha$$) to guide the learning process.
The TD update equation is presented as
\[V_{new}(s_t) = V(s_t) + \alpha (R_{t+1} + \gamma V(s_{t+1}) - V(s_t))\]
This equation highlights the essence of TD learning, that is, to adjust the estimated state value based on the observed reward and the estimated value of the next state. Both of which are immediately available after a transition.
Set current state value, next state value, reward, and step-size parameter in the interactive formula below, and observe how the new value is calculated.
$$V(s_t) = $$
$$\gamma = $$
0 1$$V(s_{t+1}) = $$
$$R_{t+1} = $$
$$\alpha = $$
Compared to the MC method, the TD method is more computationally efficient because of its frequent update. However, it also possesses bias and may not converge to the true value function because it uses an estimated function $$V(s)$$ for the update instead of a true value function. Despite these limitations, the TD method demonstrates practical effectiveness in various real-world applications and is being successfully applied to various RL problems. The TD method generally converges faster than the MC method. However, this was not the case for the examples presented in this paper due to the random training policy.
V. Difference between MC and TD
Both the MC and TD approaches are used in RL to estimate the value function, differing only in the way they update this function. The MC method estimates values by averaging the sample returns obtained from multiple episodes, while the TD method uses the Bellman equation to compute the target values from the estimated values. In other words, the MC method can update the values only at the end of each episode, while the TD method can update the values after each time step. The TD approach has an inherent capability to swiftly pinpoint meaningful interactions because of its incremental update scheme. This scheme enables the ongoing episode to be directly influenced by the changes resulting from the previous interactions. This mechanism contributes to a faster convergence that enhances the overall efficiency of the TD method as compared to the MC approach. In summary, the MC method is more computationally expensive and time consuming, while the TD method is more efficient and takes faster to converge.
While the MC and TD methods estimate values differently, both asymptotically converge to the true values. The optimal agent’s behavior is depicted in the figure below. The true values were obtained by considering all the possible future states and actions using an environment model. The optimal agent selects the best action based on these values, thereby resulting in the highest possible reward. While the optimal agent considers all the possibilities, the MC and TD agents only consider a subset of possible future states and actions.
The bias and the variance of these two algorithms can be considered from a machine learning perspective. MC learning has a low bias, but a high variance because it estimates the value function by averaging the returns from multiple episodes, which can lead to a high variance. However, the value function is updated only with the sample returns; thus, the estimation introduces no bias. By contrast, TD learning has a low variance, but a high bias because it updates the value function at each time step based on the estimated target values, this leads to a lower variance because each update is based on a single time step. It can, however, introduce a bias because the value function is updated with the estimated target values.
In practice, the choice between the MC and TD methods depends on a specific problem and available resources. Understanding the differences between both approaches enables researchers and practitioners to choose the most appropriate approach for a particular application.
Table I summarizes the differences between the MC and TD methods.
Aspect | MC | TD |
---|---|---|
Update Timing | Episodic | Online |
Learning from Partial Data | No (Requires complete episodes) | Yes (Can learn from incomplete episodes) |
Bootstrapping | No (No bootstrapping) | Yes (Uses bootstrapping) |
Bias | Unbiased | Biased (due to bootstrapping) |
Variance | High | Lower (due to bootstrapping) |
Convergence | Slower | Faster |
Computation | More computationally expensive | Less computationally expensive |
VI. n-step TD
The previous sections introduced the MC and TD methods, which are two principal methods for estimating the value function. Each method has distinct advantages and disadvantages. For instance, the TD method excels in computational efficiency, but introduces a bias in its estimates. On the contrary, the MC method provides unbiased estimates, but at the cost of a high variance.
This difference between both methods raises the following thought-provoking question: Can the strengths of these methods be synthesized while their weaknesses are mitigated? Can we seek a compromise, that is, a “best-of-both-worlds” approach that could effectively balance the bias–variance trade-off? This idea triggers the exploration of a hybrid strategy that could seamlessly integrate the TD method’s efficiency and the MC method’s unbiased nature. Let us try to understand this by delving deeper into how the MC and TD methods perform their computations. The MC method approximates the sample returns as follows by employing the full trajectory of an episode:
\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots + \gamma^{T-t-1} R_{T}\]
where, $$T$$ represents an episode length.
By contrast, the TD method leverages only a single information step for the value updates:
\[G_t = R_{t+1} + \gamma V(S_{t+1})\]
Therefore, when crafting a combined methodology, a model that uses multiple steps for value updates, but not necessarily the entire trajectory, could be designed. The three-step return could be used as a target for the TD update.
\[G_{t:t+3} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 V(S_{t+3})\]
In this case, $$\gamma^3 V(S_{t+3})$$ is used as a proxy for $$\gamma^{3} R_{t+4} + \dots + \gamma^{T-t-1} R_{T}$$. The concept of N-step TD [9][10] learning is used here, offering a promising balance between bias and variance and ultimately improving the learning efficiency in RL scenarios. Generalizing this yields the following n-step return for arbitrary n:
\[G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\]
The N-step TD update equation is as follows:
\[V_{new}(s_t) = V(s_t) + \alpha (G_{t:t+n} - V(s_t))\]
The RMSE of the MC, TD, and N-step TD agents can be examined with reference to the figure below. The observed data showed that the performance of the N-step TD method was nestled between that of the MC and TD methods. In other words, the N-step TD method proficiently bridged the gap between the MC and TD methods, achieving a more effective balance between variance and bias. It successfully combined the advantages of both methodologies, serving as a compelling compromise for mitigating the disadvantages of each method.
Optimal behavior can vary depending on the MDP. In the figure below, you can create your own MDP and check how optimal, MC, TD, and N-step TD agents behave. The value table for the optimal agent is calculated based on optimal policy.
VII. Conclusion
This paper presented overviews of the two common approaches in RL: the MC and TD methods. While these methods have their respective advantages and disadvantages, note that RL encompasses many other approaches beyond these two. For example, model-based [6] and policy gradient methods [7] are also commonly used in RL, which may better suit certain problems or have particular advantages in different contexts.
The recent advancements in RL have demonstrated the potential of combining RL with deep learning to achieve even better results [8]. From game playing and robotics to natural language processing and recommendation systems, these hybrid approaches have shown great promise in various applications. More innovations and breakthroughs in RL will likely emerge with its continuous evolution. Witnessing how these new approaches and techniques can be applied to real-world challenges will be exciting.
References
- Reinforcement learning: An introduction Sutton, R. S., & Barto, A. G. 2018. MIT press.
- Mastering the game of go without human knowledge Silver, David, et al. 2017. Nature 550.7676: 354-359. DOI:10.1038/nature24270
- Grandmaster level in StarCraft II using multi-agent reinforcement learning Vinyals, Oriol, et al. 2019. Nature 575.7782: 350-354. DOI:10.1038/s41586-019-1724-z
- Highly accurate protein structure prediction with AlphaFold Jumper, John, et al. 2021. Nature 596.7873: 583-589. DOI:10.1038/s41586-021-03819-2
- RL from Basics S. Rho, Badak Buteo Baeunenun Ganghwa Hakseup [RL from Basics] (in Korean). Seoul, South Korea: Youngjin.com, pp. 32-36, 2020.
- Model-based reinforcement learning: A survey Moerland, T. M., Broekens, J., Plaat, A., & Jonker, C. M. 2023. Foundations and Trends in Machine Learning, 16(1), 1-118. DOI:10.1561/2200000086
- A natural policy gradient Kakade, S. M. 2001. Advances in neural information processing systems, 14.
- Human-level control through deep reinforcement learning Mnih, Volodymyr, et al. 2015.Nature 518.7540: 529-533. DOI:10.1038/nature14236
- TD models: Modeling the world at a mixture of time scales Sutton, Richard S. 1995.Machine Learning Proceedings 531-539. DOI:10.1016/B978-1-55860-377-6.50072-4
- Td (λ) networks: temporal-difference networks with eligibility traces Tanner, Brian, and Richard S. Sutton. 2005.Proceedings of the 22nd international conference on Machine learning. DOI:10.1145/1102351.1102463
- Markov decision processes Puterman, Martin L. 1990.Handbooks in operations research and management science 2 : 331-434. DOI:10.1016/S0927-0507(05)80172-0