Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: augment pre-training with lichess open database #5

Open
linux-leo opened this issue Dec 27, 2023 · 1 comment
Open

Suggestion: augment pre-training with lichess open database #5

linux-leo opened this issue Dec 27, 2023 · 1 comment

Comments

@linux-leo
Copy link

linux-leo commented Dec 27, 2023

See: https://database.lichess.org/#standard_games

Maybe use every nth game from the year 2013 before lichess grew in size, so the dataset covers a more or less equal amount of games per month while still covering a large time span, and to reduce the amount of games that need to be processed.

PS: I'm happy to provide some compute for this project with my google colab pro+ Subscription :)

@Thytu
Copy link
Owner

Thytu commented Jan 14, 2024

Hey @linux-leo,

Awesome to see your interest in the project! Just got back from a travel trip, so I'm catching up.
Appreciate your suggestions on improving the training dataset, you've got some great points there.

To add few possible improvement:

  • Mixing up chess and language-oriented datasets sounds like a solid plan as the model tends to overproduce chess move tokens when trained only on strategic_game_chess and ChessInstruct.
  • Creating a chess-focused language dataset by scraping books and websites sounds cool, but time can be a bit of a buzzkill. If you spot any juicy data, feel free to toss it my way.
  • Lichess puzzles as an instructive task sounds a good idea as they require the model to produce the best possible move each time.

And your offer for compute power? Legendary!
I managed to have access to an H100, so we should be golden for now, still, thanks a bunch for having my back :)

Feel free to drop more thoughts whenever they pop into your head. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants