Monte Carlo (MC) methods do not require the entire environment to be known in order to find optimal behavior. The term “Monte Carlo” is broadly used for any estimation method that involves a significant random component. In our case, all they rely on is experience — repeated sequences of states, actions, and rewards — from interaction with the environment. We divide these interactions into episodes, in order to be able to define beginnings and ends to these sequences. We can use the same concept from the last blog to evaluate a policy, but the difference is key: whereas last time we computed value functions based on knowledge of an MDP, this time we learn value functions from sample returns with the MDP.