You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Allow setting the Markov state (augmented_state and other variables as needed) of the env - use this in render() to allow imaginary rollouts from a custom starting state there.
Copy file name to clipboardExpand all lines: mdp_playground/envs/rl_toy_env.py
+78-9Lines changed: 78 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -202,7 +202,9 @@ class RLToyEnv(gym.Env):
202
202
R(state, action)
203
203
defined as a lambda function in the call to init_reward_function() and is equivalent to calling reward_function()
204
204
get_augmented_state()
205
-
gets underlying Markovian state of the MDP
205
+
gets underlying Markovian state of the MDP as a dictionary
206
+
set_augmented_state(augmented_state_dict)
207
+
sets underlying Markovian state of the MDP, by default, using a dictionary in the same format as returned by get_augmented_state()
206
208
reset()
207
209
Resets environment state
208
210
seed()
@@ -1575,8 +1577,14 @@ def get_rews(rng, r_dict):
1575
1577
deftransition_function(self, state, action):
1576
1578
"""The transition function, P.
1577
1579
1578
-
Performs a transition according to the initialised P for discrete environments (with dynamics independent for relevant vs irrelevant dimension sub-spaces). For continuous environments, we have a fixed available option for the dynamics (which is the same for relevant or irrelevant dimensions):
1579
-
The order of the system decides the dynamics. For an nth order system, the nth order derivative of the state is set to the action value / inertia for time_unit seconds. And then the dynamics are integrated over the time_unit to obtain the next state.
1580
+
Performs a transition according to the initialised P for discrete environments (the independent dynamics for the irrelevant
1581
+
dimension sub-spaces are handled in step() for discrete envs while they are handled here for grid envs).
1582
+
For continuous environments, we have a fixed available option for the dynamics (which is the same for relevant
1583
+
or irrelevant dimensions):
1584
+
The order of the system decides the dynamics. For an nth order system, the nth order derivative of the state is set to the
1585
+
action value / inertia for time_unit seconds. And then the dynamics are integrated over the time_unit to obtain the next state.
1586
+
1587
+
###TODO Make this function use Markov state also for continuous envs.
The action that the environment will use to perform a transition.
1991
1999
imaginary_rollout: boolean
1992
-
Option for the user to perform "imaginary" transitions, e.g., for model-based RL. If set to true, underlying augmented state of the MDP is not changed and user is responsible to maintain and provide a list of states to this function to be able to perform a rollout.
2000
+
Unsupported at the moment. Option for the user to perform "imaginary" transitions, e.g., for model-based RL. If set to true, underlying augmented state of the MDP is not changed and user is responsible to maintain and provide a list of states to this function to be able to perform a rollout.
"""Intended to return the full augmented state which would be Markovian. (However, it's not Markovian wrt the noise in P and R because we're not returning the underlying RNG.) Currently, returns the augmented state which is the sequence of length "delay + sequence_length + 1" of past states for both discrete and continuous environments. Additonally, the current state derivatives are also returned for continuous environments.
2128
+
"""Intended to return the full augmented state which would be Markovian. (However, it's not Markovian wrt the noise in P and R
2129
+
because we're not returning the underlying RNG.)
2121
2130
2122
-
Returns
2131
+
Returns a dictionary with the following keys:
2123
2132
-------
2124
-
dict
2125
-
Contains at the end of the current transition
2133
+
2134
+
augmented_state contains the sequence / list of past states of length "delay + sequence_length + 1". Each element in this list contains
2135
+
only the relevant parts for discrete envs, continuous (only 0th order info, i.e., position) and grid envs iirc.
2136
+
state_derivatives contains the list of state derivatives - only present for continuous envs.
2137
+
curr_state contains the relevant and irrelevant parts (if any) for discrete, continuous and grid envs.
2138
+
curr_obs contains the same unless image_representations is True, in which case it contains the image representation of curr_state.
2139
+
2140
+
Remark: relevant_indices for cont. envs can be figured out using curr_state and augmented_state. So, all the info needed to make the
2141
+
state Markov is present in the returned dict (except for the RNG state if P and R are noisy). This could be improved but this is the
2142
+
current implementation.
2126
2143
2127
2144
"""
2128
2145
# #TODO For noisy processes, this would need the noise distribution and random seed too. Also add the irrelevant state parts, etc.? We don't need the irrelevant parts for the state to be Markovian.
"""Resets the environment for the beginning of an episode and samples a start state from rho_0. For discrete environments uses the defined rho_0 directly. For continuous environments, samples a state and resamples until a non-terminal state is sampled.
0 commit comments