Suggestion: augment pre-training with lichess open database #5

linux-leo · 2023-12-27T16:32:35Z

See: https://database.lichess.org/#standard_games

Maybe use every nth game from the year 2013 before lichess grew in size, so the dataset covers a more or less equal amount of games per month while still covering a large time span, and to reduce the amount of games that need to be processed.

PS: I'm happy to provide some compute for this project with my google colab pro+ Subscription :)

Thytu · 2024-01-14T17:10:13Z

Hey @linux-leo,

Awesome to see your interest in the project! Just got back from a travel trip, so I'm catching up.
Appreciate your suggestions on improving the training dataset, you've got some great points there.

To add few possible improvement:

Mixing up chess and language-oriented datasets sounds like a solid plan as the model tends to overproduce chess move tokens when trained only on strategic_game_chess and ChessInstruct.
Creating a chess-focused language dataset by scraping books and websites sounds cool, but time can be a bit of a buzzkill. If you spot any juicy data, feel free to toss it my way.
Lichess puzzles as an instructive task sounds a good idea as they require the model to produce the best possible move each time.

And your offer for compute power? Legendary!
I managed to have access to an H100, so we should be golden for now, still, thanks a bunch for having my back :)

Feel free to drop more thoughts whenever they pop into your head. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: augment pre-training with lichess open database #5

Suggestion: augment pre-training with lichess open database #5

linux-leo commented Dec 27, 2023 •

edited

Loading

Thytu commented Jan 14, 2024

Suggestion: augment pre-training with lichess open database #5

Suggestion: augment pre-training with lichess open database #5

Comments

linux-leo commented Dec 27, 2023 • edited Loading

Thytu commented Jan 14, 2024

linux-leo commented Dec 27, 2023 •

edited

Loading