notes.txt

develop in conda, since ray with venv is broken (why?) and don't have openspiel
aur package

We actually have 3 components: Sense, Estimate, Action
where Sense and Action are action agents

use same estimation as trout bot but use vanilla RL on it
use same estimation as trout bot but use IRL on it

Links:
papers on drl and imperfect information
https://www.reddit.com/r/reinforcementlearning/comments/hdzmv3/help_with_reinforcement_learning_in_reconchess/
https://arxiv.org/abs/2006.10410
https://arxiv.org/abs/1603.01121

Multi-Task MCEIRL
http://export.arxiv.org/pdf/1805.08882 (sense/move are two tasks?)
Generalized MCEIRL (with reward function not being a linear mapping i think)

How to break it up?

Have default estimator in trout: trout_sensor
Have default planner in trout: stockfish

1. Use RL to replace stockfish but keep estimator
    Use MCEIRL to replace stockfish

2. Use RL to replace estimator but keep stockfish planner
    Have a good default planner but try to improve the information stockfish
    receives

3. Use RL for both estimator and planner
    Do this after we can get at least a better estimator or planner

Control flow with inputs:

perceived board state:
For each piece a prob attached to each block and not being on the grid
(8x8 + 1 probs which sum to 1)

perceived board state is updated with the 1) estimator, 2) your own moves,
3) opponent move if they capture one of your pieces

Estimator:
input: previous perceived board state ?
output: block to sense [space: 7x7]
(to decrease the space can limit it to the inner grid
(7x7) since we sense a 3x3 so the edges/corners will be sensed even with a sense
coordinate once block inside

if u want the estimator to be a bit biased on looking where the planner wants to
move then the planner should query a move and then that should be fed as input
to the estimator. The estimator then picks a grid to sense, which information is
then fed into the planner to actually make the move

loss function input: ?

Planner:
input:  ?
output: piece movement [space variable per turn] how to encode this as a NN
output? // Figure out

loss function input: ?

remember we can just MCEIRL on top of both Trout's sensor and planner

// Report Due: 3/24/22
IEEE conf format
updated slide
Title of the project,
motivation,
problem formulation,
prior literature review,

// Second Report Due: 4/12/22
Add results for MCEIRL

Run rllib env (okay it runs!)
make a child env from the open_spiel env to have our own render function (done!)
Implement Random, Attacker, and Trout via rllib env (done!)

start off with loading in env and creating a simple agent for it
Then create an MCEIRL Policy/Trainer

// Do we need to consider parametric action spaces for the rbc env?
// Along with custom models
https://docs.ray.io/en/latest/rllib/rllib-models.html#variable-length-parametric-action-spaces

MCEIRL implementation progress:
- gather methods that need to be implemented from Policy/TorchPolicy
Policy not implemented:
    compute_actions
    compute_log_likelihoods
    loss
    load_batch_into_buffer
    get_num_samples_loaded_into_buffer
    learn_on_loaded_batch
    compute_gradients
    apply_gradients
    get_weights
    set_weights
    export_checkpoint
    export_model
    import_model_from_h5
from these TorchPolicy implements:
    compute_actions
    compute_log_likelihoods
    load_batch_into_buffer
    get_num_samples_loaded_into_buffer
    learn_on_loaded_batch
    compute_gradients
    apply_gradients
    get_weights
    set_weights
    export_model
    import_model_from_h5
and TorchPolicy does not implement:
    loss (algorithm)
    export_checkpoint (engineering)
TorchPolicy additionally requires:
    nothing else

- understand high level algorithm
This is godsend: https://apps.dtic.mil/sti/pdfs/AD1090741.pdf

# TODO
Code & Implement MCEIRL alg

Experiments/Graphs to show:
(winrate for 100 games at every X many checkpoints) at least 10ish

train with attacker behaviors
-compare MCEIRL with random
-compare MCEIRL with attacker

train with trout behaviors
-compare MCEIRL with random
-compare MCEIRL with attacker
-compare MCEIRL with trout

MCEIRL alg:
Steps: (Pseudocode:)

Tensorboard viewing:
tensorboard --logdir <logdir>
tensorboard --logdir ~/vcs/git/github/acxz/blinding-ray/logs

Misc. issues:
Figure out why dataset api is not working

or figure out how to json read multiagent
Upstream issue: https://github.com/ray-project/ray/issues/24283
Use BC for now as it does not require postprocess_input to be true
For BC which agent is the behavior being cloned for in a multiagent setting?

When loading experiences during training for trout get the following error:
Looks like we kill it, but don't restart it
workaround: during training comment out TroutCallbacks
```terminal output
Training
Train iter: 1/100
*** SIGTERM received at time=1651109112 on cpu 2 ***
PC: @     0x7f8bdc97e79b  (unknown)  kill
    @     0x7f8bdc97e560  (unknown)  (unknown)
[2022-04-27 21:25:12,887 E 141269 141269] logging.cc:325: *** SIGTERM received at time=1651109112 on cpu 2 ***
[2022-04-27 21:25:12,887 E 141269 141269] logging.cc:325: PC: @     0x7f8bdc97e79b  (unknown)  kill
[2022-04-27 21:25:12,888 E 141269 141269] logging.cc:325:     @     0x7f8bdc97e560  (unknown)  (unknown)
prob already dead
Train iter: 2/100
prob already dead
Train iter: 3/100
prob already dead
Train iter: 4/100
prob already dead
[1]    141269 IOT instruction (core dumped)  python scripts/train_mceirl_rbc.py
```


Imp upstream: (other issues and bugs listed in code comments)

To get the info state need the following patch in (upstream)
/rllib/evaluation/collectors/simple_list_collector.py
```python
            # Create the batch of data from the different buffers.
            data_col = view_req.data_col or view_col
            delta = (
                -1
                if data_col
                in [
                    SampleBatch.OBS,
                    SampleBatch.ENV_ID,
                    SampleBatch.EPS_ID,
                    SampleBatch.AGENT_INDEX,
                    SampleBatch.T,
                    # (acxz): add INFOS
                    SampleBatch.INFOS,
                ]
                else 0
```
Note: This doesn't solve the fact that we don't get an info for the first obs

open_spiel rbc dones't always have good checks and return pyspiel.SpielError
Thus we need to do a manually assertion in the ray/rllib/env/wrappers/open_spiel
file:

```python
            try:
                # TODO: (acxz) sometimes this is an illegal action yet no error is thrown
                # This is prob RBC code's fault for not throwing a pyspiel.SpielError i'm guessing
                # So we do an assertion and then catch it
                assert action[curr_player] in self.state.legal_actions()

                self.state.apply_action(action[curr_player])
            # TODO: (sven) resolve this hack by publishing legal actions
            #  with each step.
            except (AssertionError, pyspiel.SpielError) as e:
                self.state.apply_action(
                    np.random.choice(self.state.legal_actions()))
                penalties[curr_player] = -0.1
```

Add human play to evaluation as a demonstration

TODO: how to get the match data and play it back with reconchess's rc-replay
script (at the very least i just wanna gui the game)
okay state is being saved in output.json via infos
now need to parse output.json and write the state sequence to rc-replay compat
json (do i also need sense actions?)
Whoa need a lot more than I thought
Make feature request for rbc openspiel to add output in APL format and then from
rllib env wrapper, I can just do `infos = {self.gamehistory}`
I guess till then I can't really see nice gui...
since if openspiel adds that functionality to rbc then
they need it for everything? (maybe it would just a rbc utility function?)
idk maybe still possible through just rllib env wrapper,
I just need the function mapping as the game is being played
but it would be a good bit of work that we can put off for now, I think
one last thing, see if openspiel does have some kinda of game history output for
rbc (maybe with state->ActionToString)
right now work with ascii chess.board representation (does rbc have their own
ascci repr? need for sensing)

TODO: Add a wrapper on the RBC bot API to creating a RL policy in rllib for easier
implementation of bots in both rllib open_spiel and APL's rbc servers

openspiel spits out error message for incorrect action no matter what (can't
supress it), for sense action we supress by modifying the sense to stay in
corner 6x6

Should IRL algorithms be a mixin or a plugin based algorithm (like exploration
based curiosity)?
This is because the reward learning is technically a different problem from
policy learning. Should be plugin and play, imo.
Typically IRL algs do the reward learning and use a separate alg to learn the
policy anyway.
Maybe ask the HumanCompatibleAI/imitation folks about this idea?