-
Notifications
You must be signed in to change notification settings - Fork 0
/
log.txt
69 lines (65 loc) · 6.14 KB
/
log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
4/9 made good progress. Used chatgpt to implement the basic game with gui. Still need to make adjustments
4/19 decided to build a qdn as the first model. Implemented SeaBattleEnv for qdn. Found issue with ship placement function which places ship next to other ships, which is forbidden in the GamePigeon version of sea battle.
4/20 trained the first qdn model. doesn't seem to work well(always chooses invalid actions and get stuck).Debugging.
4/21 the probelm is likely to be the reward for ending the game is not set to a high value; instead, it was treated the same as choosing the wrong action.
the version is name 0.1.
the new envrionment with a new batch size of 512 doesn't seem to converge well; random spikes of high loss. training ongoing
the new ver 0.11 shows no improvement over 0.1
training new model with larger hidden layers still doesn't seem to converge, steps to complete is ~200 (49*10 for maximum)
next improvement: one-hot encoding and use the last action chosen as part of the input.
4/22 improved repo structure
today's goal: organize sea_battle_dqn.py
finally works... i shouldn't set the rewards to large numbers with huge differences
0.2.1: one-hot + last action; 3500 episodes:
Episode: 3260, Loss: 6.83033576933667e-05, Epsilon: 0.00998645168764533, Steps to complete: 24
Episode: 3280, Loss: 8.667529618833214e-05, Epsilon: 0.00998645168764533, Steps to complete: 24
Episode: 3300, Loss: 6.79210206726566e-05, Epsilon: 0.00998645168764533, Steps to complete: 27
Episode: 3320, Loss: 5.839686491526663e-05, Epsilon: 0.00998645168764533, Steps to complete: 31
Episode: 3340, Loss: 7.395146531052887e-05, Epsilon: 0.00998645168764533, Steps to complete: 38
Episode: 3360, Loss: 7.044956873869523e-05, Epsilon: 0.00998645168764533, Steps to complete: 26
Episode: 3380, Loss: 0.00010497208859305829, Epsilon: 0.00998645168764533, Steps to complete: 21
Episode: 3400, Loss: 9.763846173882484e-05, Epsilon: 0.00998645168764533, Steps to complete: 30
Episode: 3420, Loss: 7.538014324381948e-05, Epsilon: 0.00998645168764533, Steps to complete: 30
Episode: 3440, Loss: 5.732289355364628e-05, Epsilon: 0.00998645168764533, Steps to complete: 25
Episode: 3460, Loss: 8.03233269834891e-05, Epsilon: 0.00998645168764533, Steps to complete: 26
Episode: 3480, Loss: 5.7267687225248665e-05, Epsilon: 0.00998645168764533, Steps to complete: 32
0.2.2: one-hot + last action; 6000 episodes;
Episode: 5580, Loss: 0.0006477560382336378, Epsilon: 0.00998645168764533, Steps to complete: 23
Episode: 5600, Loss: 0.0010723448358476162, Epsilon: 0.00998645168764533, Steps to complete: 18
Episode: 5620, Loss: 0.0008745631203055382, Epsilon: 0.00998645168764533, Steps to complete: 33
Episode: 5640, Loss: 0.0012952240649610758, Epsilon: 0.00998645168764533, Steps to complete: 22
Episode: 5660, Loss: 0.0005613144603557885, Epsilon: 0.00998645168764533, Steps to complete: 25
Episode: 5680, Loss: 0.0007040578057058156, Epsilon: 0.00998645168764533, Steps to complete: 20
Episode: 5700, Loss: 0.0017398456111550331, Epsilon: 0.00998645168764533, Steps to complete: 23
Episode: 5720, Loss: 0.0007664738805033267, Epsilon: 0.00998645168764533, Steps to complete: 22
Episode: 5740, Loss: 0.00034564395900815725, Epsilon: 0.00998645168764533, Steps to complete: 24
Episode: 5760, Loss: 0.0013082540826871991, Epsilon: 0.00998645168764533, Steps to complete: 29
Episode: 5780, Loss: 0.0003618979826569557, Epsilon: 0.00998645168764533, Steps to complete: 23
Episode: 5800, Loss: 0.0002951952046714723, Epsilon: 0.00998645168764533, Steps to complete: 34
Episode: 5820, Loss: 0.0008323941146954894, Epsilon: 0.00998645168764533, Steps to complete: 25
Episode: 5840, Loss: 0.0004208995960652828, Epsilon: 0.00998645168764533, Steps to complete: 21
Episode: 5860, Loss: 0.0021627547685056925, Epsilon: 0.00998645168764533, Steps to complete: 28
Episode: 5880, Loss: 0.0004226358432788402, Epsilon: 0.00998645168764533, Steps to complete: 28
Episode: 5900, Loss: 0.0004482130752876401, Epsilon: 0.00998645168764533, Steps to complete: 27
Episode: 5920, Loss: 0.0003895494737662375, Epsilon: 0.00998645168764533, Steps to complete: 33
Episode: 5940, Loss: 0.000297995051369071, Epsilon: 0.00998645168764533, Steps to complete: 30
Episode: 5960, Loss: 0.002398234326392412, Epsilon: 0.00998645168764533, Steps to complete: 28
Episode: 5980, Loss: 0.00029560859547927976, Epsilon: 0.00998645168764533, Steps to complete: 25
0.3.0 one-hot + last action + dynamic rewards(for missed shots); 10000 episodes:
4/30: huge bug found. my _get_observation() returns all the information including the locations of unexplored ships! how does it still take more than 10 steps??
implemented ddqn; performs better the qdn.
0.1.1 achieved lower losses with more neurons, it seems there is still room to lower the loss with more training episodes
ddqn_v0.1.5 has a significant lower loss, but more or less the same ave step comparing to v0.1.4?
the performance(ave step to complete the game) worsens after 18,000 episodes and all the models seem to converge to a rather random policy.
As i tested before, a policy that selects a random valid action will complete games at average 29 steps.
tensorboard --logdir=runs --bind_all --reload_interval=5
python bin/pg_trainer.py --config config/pg_v0.1.0.toml --save_path models/
5/10 refactored sea_battle.py as sea_battle2.py. there is only one gym.Env class SeaBattleEnv that implements all functionality needed for
training. Improved efficency and readablility.
tensorboard: tensorboard --logdir runs --bind_all
ip addr
5/11:policy gradient still won't converge. will implement a ddqn with the new gym.Env(sea2)
python bin/trainer.py --config config/dqn_v0.2.0.toml --save_path models/
python .\bin\evaluator.py --model_path models/sea2_dqn_v0.2.0.pth --config config/dqn_v0.2.0.toml
5/22: finally figured out policy gradient/a2c. they weren't lying when they say rl is extremely sensitive to hyperparameters
6/26: python bin\trainer.py --config config/ppo_v0.1.0.toml