-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve gamemode weighting #34
base: master
Are you sure you want to change the base?
Improve gamemode weighting #34
Conversation
# change weights from percentage of experience desired to percentage of gamemodes necessary (approx) | ||
for k in self.gamemode_weights.keys(): | ||
b, o = k.split("v") | ||
self.gamemode_weights[k] /= int(b) | ||
weights_sum = sum(self.gamemode_weights.values()) | ||
self.gamemode_weights = {k: self.gamemode_weights[k] / weights_sum for k in self.gamemode_weights.keys()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to reuse a public attribute named as a constructor arg to store values different from the args provided, in other words, i think self.game_mode
should be readonly for the sake of clarity and consistency
self.gamemode_weights = {k: max(self.gamemode_weights[k] + diff[k], 0) for k in self.gamemode_weights.keys()} | ||
new_sum = sum(self.gamemode_weights.values()) | ||
self.gamemode_weights = {k: self.gamemode_weights[k] / new_sum for k in self.gamemode_weights.keys()} | ||
mode = np.random.choice(list(self.gamemode_weights.keys()), p=list(self.gamemode_weights.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this method improves on the previous one in several ways (uses random sampling and reduces worker correlation), they still both suffer from the same downfall: The algorithm is based on a proportional correction based on the error between target and current distribution.
This means that as the current distribution gets closer to the target one, the error goes to zero and so does the correction term, which causes the error to go back up.
Nevertheless, I can see how this new algorithm is more robust than the previous one when multiple workers are sampling in parallel but the empirical distribution will probably never reach the target one.
You can use this algorithm to get more stable sampling probabilities over time:
For step 1 you can use an EMA initialized based on agent count or anything other than 0, e.g. |
What's the conclusion here? |
Conclusion is I got busy and didn't finish it, but it's still on my list.
I'm going to take the suggestions, just haven't finished yet.
…On Tue, Dec 6, 2022, 6:17 PM Rolv-Arild ***@***.***> wrote:
What's the conclusion here?
—
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWLB4KVNU7TODTBY6367GRLWL7CPPANCNFSM6AAAAAAQLZBVQ4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok, this is ready and tested. Uses the EMA for the weights, per worker. Generated experience is per actual, which means that if you're using pretrained agents or past models, those percentages naturally come out of the generated experience, which I think is ideal. |
added one commit for the 1v0 fixes that is related to this. |
This changes the gamemode weighting to be faster (no redis calls) and more stable so that it doesn't swing around so much. Each training step should have a very, very similar weight of gamemodes, that's very close to the desired weight, even if the training steps are very short (I've tested with 100k steps).