-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
256 lines (211 loc) · 8.79 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
develop in conda, since ray with venv is broken (why?) and don't have openspiel
aur package
We actually have 3 components: Sense, Estimate, Action
where Sense and Action are action agents
use same estimation as trout bot but use vanilla RL on it
use same estimation as trout bot but use IRL on it
Links:
papers on drl and imperfect information
https://www.reddit.com/r/reinforcementlearning/comments/hdzmv3/help_with_reinforcement_learning_in_reconchess/
https://arxiv.org/abs/2006.10410
https://arxiv.org/abs/1603.01121
Multi-Task MCEIRL
http://export.arxiv.org/pdf/1805.08882 (sense/move are two tasks?)
Generalized MCEIRL (with reward function not being a linear mapping i think)
How to break it up?
Have default estimator in trout: trout_sensor
Have default planner in trout: stockfish
1. Use RL to replace stockfish but keep estimator
Use MCEIRL to replace stockfish
2. Use RL to replace estimator but keep stockfish planner
Have a good default planner but try to improve the information stockfish
receives
3. Use RL for both estimator and planner
Do this after we can get at least a better estimator or planner
Control flow with inputs:
perceived board state:
For each piece a prob attached to each block and not being on the grid
(8x8 + 1 probs which sum to 1)
perceived board state is updated with the 1) estimator, 2) your own moves,
3) opponent move if they capture one of your pieces
Estimator:
input: previous perceived board state ?
output: block to sense [space: 7x7]
(to decrease the space can limit it to the inner grid
(7x7) since we sense a 3x3 so the edges/corners will be sensed even with a sense
coordinate once block inside
if u want the estimator to be a bit biased on looking where the planner wants to
move then the planner should query a move and then that should be fed as input
to the estimator. The estimator then picks a grid to sense, which information is
then fed into the planner to actually make the move
loss function input: ?
Planner:
input: ?
output: piece movement [space variable per turn] how to encode this as a NN
output? // Figure out
loss function input: ?
remember we can just MCEIRL on top of both Trout's sensor and planner
// Report Due: 3/24/22
IEEE conf format
updated slide
Title of the project,
motivation,
problem formulation,
prior literature review,
// Second Report Due: 4/12/22
Add results for MCEIRL
Run rllib env (okay it runs!)
make a child env from the open_spiel env to have our own render function (done!)
Implement Random, Attacker, and Trout via rllib env (done!)
start off with loading in env and creating a simple agent for it
Then create an MCEIRL Policy/Trainer
// Do we need to consider parametric action spaces for the rbc env?
// Along with custom models
https://docs.ray.io/en/latest/rllib/rllib-models.html#variable-length-parametric-action-spaces
MCEIRL implementation progress:
- gather methods that need to be implemented from Policy/TorchPolicy
Policy not implemented:
compute_actions
compute_log_likelihoods
loss
load_batch_into_buffer
get_num_samples_loaded_into_buffer
learn_on_loaded_batch
compute_gradients
apply_gradients
get_weights
set_weights
export_checkpoint
export_model
import_model_from_h5
from these TorchPolicy implements:
compute_actions
compute_log_likelihoods
load_batch_into_buffer
get_num_samples_loaded_into_buffer
learn_on_loaded_batch
compute_gradients
apply_gradients
get_weights
set_weights
export_model
import_model_from_h5
and TorchPolicy does not implement:
loss (algorithm)
export_checkpoint (engineering)
TorchPolicy additionally requires:
nothing else
- understand high level algorithm
This is godsend: https://apps.dtic.mil/sti/pdfs/AD1090741.pdf
# TODO
Code & Implement MCEIRL alg
Experiments/Graphs to show:
(winrate for 100 games at every X many checkpoints) at least 10ish
train with attacker behaviors
-compare MCEIRL with random
-compare MCEIRL with attacker
train with trout behaviors
-compare MCEIRL with random
-compare MCEIRL with attacker
-compare MCEIRL with trout
MCEIRL alg:
Steps: (Pseudocode:)
Tensorboard viewing:
tensorboard --logdir <logdir>
tensorboard --logdir ~/vcs/git/github/acxz/blinding-ray/logs
Misc. issues:
Figure out why dataset api is not working
or figure out how to json read multiagent
Upstream issue: https://github.com/ray-project/ray/issues/24283
Use BC for now as it does not require postprocess_input to be true
For BC which agent is the behavior being cloned for in a multiagent setting?
When loading experiences during training for trout get the following error:
Looks like we kill it, but don't restart it
workaround: during training comment out TroutCallbacks
```terminal output
Training
Train iter: 1/100
*** SIGTERM received at time=1651109112 on cpu 2 ***
PC: @ 0x7f8bdc97e79b (unknown) kill
@ 0x7f8bdc97e560 (unknown) (unknown)
[2022-04-27 21:25:12,887 E 141269 141269] logging.cc:325: *** SIGTERM received at time=1651109112 on cpu 2 ***
[2022-04-27 21:25:12,887 E 141269 141269] logging.cc:325: PC: @ 0x7f8bdc97e79b (unknown) kill
[2022-04-27 21:25:12,888 E 141269 141269] logging.cc:325: @ 0x7f8bdc97e560 (unknown) (unknown)
prob already dead
Train iter: 2/100
prob already dead
Train iter: 3/100
prob already dead
Train iter: 4/100
prob already dead
[1] 141269 IOT instruction (core dumped) python scripts/train_mceirl_rbc.py
```
Imp upstream: (other issues and bugs listed in code comments)
To get the info state need the following patch in (upstream)
/rllib/evaluation/collectors/simple_list_collector.py
```python
# Create the batch of data from the different buffers.
data_col = view_req.data_col or view_col
delta = (
-1
if data_col
in [
SampleBatch.OBS,
SampleBatch.ENV_ID,
SampleBatch.EPS_ID,
SampleBatch.AGENT_INDEX,
SampleBatch.T,
# (acxz): add INFOS
SampleBatch.INFOS,
]
else 0
```
Note: This doesn't solve the fact that we don't get an info for the first obs
open_spiel rbc dones't always have good checks and return pyspiel.SpielError
Thus we need to do a manually assertion in the ray/rllib/env/wrappers/open_spiel
file:
```python
try:
# TODO: (acxz) sometimes this is an illegal action yet no error is thrown
# This is prob RBC code's fault for not throwing a pyspiel.SpielError i'm guessing
# So we do an assertion and then catch it
assert action[curr_player] in self.state.legal_actions()
self.state.apply_action(action[curr_player])
# TODO: (sven) resolve this hack by publishing legal actions
# with each step.
except (AssertionError, pyspiel.SpielError) as e:
self.state.apply_action(
np.random.choice(self.state.legal_actions()))
penalties[curr_player] = -0.1
```
Add human play to evaluation as a demonstration
TODO: how to get the match data and play it back with reconchess's rc-replay
script (at the very least i just wanna gui the game)
okay state is being saved in output.json via infos
now need to parse output.json and write the state sequence to rc-replay compat
json (do i also need sense actions?)
Whoa need a lot more than I thought
Make feature request for rbc openspiel to add output in APL format and then from
rllib env wrapper, I can just do `infos = {self.gamehistory}`
I guess till then I can't really see nice gui...
since if openspiel adds that functionality to rbc then
they need it for everything? (maybe it would just a rbc utility function?)
idk maybe still possible through just rllib env wrapper,
I just need the function mapping as the game is being played
but it would be a good bit of work that we can put off for now, I think
one last thing, see if openspiel does have some kinda of game history output for
rbc (maybe with state->ActionToString)
right now work with ascii chess.board representation (does rbc have their own
ascci repr? need for sensing)
TODO: Add a wrapper on the RBC bot API to creating a RL policy in rllib for easier
implementation of bots in both rllib open_spiel and APL's rbc servers
openspiel spits out error message for incorrect action no matter what (can't
supress it), for sense action we supress by modifying the sense to stay in
corner 6x6
Should IRL algorithms be a mixin or a plugin based algorithm (like exploration
based curiosity)?
This is because the reward learning is technically a different problem from
policy learning. Should be plugin and play, imo.
Typically IRL algs do the reward learning and use a separate alg to learn the
policy anyway.
Maybe ask the HumanCompatibleAI/imitation folks about this idea?