Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gamemode weighting #34

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Kaiyotech
Copy link
Contributor

This changes the gamemode weighting to be faster (no redis calls) and more stable so that it doesn't swing around so much. Each training step should have a very, very similar weight of gamemodes, that's very close to the desired weight, even if the training steps are very short (I've tested with 100k steps).

Comment on lines 88 to 93
# change weights from percentage of experience desired to percentage of gamemodes necessary (approx)
for k in self.gamemode_weights.keys():
b, o = k.split("v")
self.gamemode_weights[k] /= int(b)
weights_sum = sum(self.gamemode_weights.values())
self.gamemode_weights = {k: self.gamemode_weights[k] / weights_sum for k in self.gamemode_weights.keys()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to reuse a public attribute named as a constructor arg to store values different from the args provided, in other words, i think self.game_mode should be readonly for the sake of clarity and consistency

Comment on lines 231 to 234
self.gamemode_weights = {k: max(self.gamemode_weights[k] + diff[k], 0) for k in self.gamemode_weights.keys()}
new_sum = sum(self.gamemode_weights.values())
self.gamemode_weights = {k: self.gamemode_weights[k] / new_sum for k in self.gamemode_weights.keys()}
mode = np.random.choice(list(self.gamemode_weights.keys()), p=list(self.gamemode_weights.values()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this method improves on the previous one in several ways (uses random sampling and reduces worker correlation), they still both suffer from the same downfall: The algorithm is based on a proportional correction based on the error between target and current distribution.
This means that as the current distribution gets closer to the target one, the error goes to zero and so does the correction term, which causes the error to go back up.
Nevertheless, I can see how this new algorithm is more robust than the previous one when multiple workers are sampling in parallel but the empirical distribution will probably never reach the target one.

@lucas-emery
Copy link
Collaborator

You can use this algorithm to get more stable sampling probabilities over time:

  1. Keep an estimate of the mean experience generated in each gamemode
  2. Calculate empirical distribution weights: Wemp = mean_exp / sum(mean_exp)
  3. Calculate corrected weights based on these estimates: Wcor = Wtarget / Wemp
  4. Calculate corrected sampling probs: P = Wcor / sum(Wcor)

For step 1 you can use an EMA initialized based on agent count or anything other than 0, e.g. mean_exp = {'1v1': 1000, '2v2': 2000, '3v3': 3000}

@Rolv-Arild
Copy link
Owner

What's the conclusion here?

@Kaiyotech
Copy link
Contributor Author

Kaiyotech commented Dec 6, 2022 via email

@Kaiyotech
Copy link
Contributor Author

Ok, this is ready and tested. Uses the EMA for the weights, per worker. Generated experience is per actual, which means that if you're using pretrained agents or past models, those percentages naturally come out of the generated experience, which I think is ideal.

@Kaiyotech
Copy link
Contributor Author

added one commit for the 1v0 fixes that is related to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants