See
puzzle_journey_data_collection_processing.ipynb
for details.
The Lichess puzzle database was downloaded from https://database.lichess.org/#puzzles on March 22, 2023. The data was in the form of a compressed .csv
using zstd
compression. This file was decompressed in command line and converted to a pandas
dataframe using .read_csv()
.
The data initially contained no headers, so we added headers in line with the database documentation.
Checking for missing or null values, we found quite a few in the Opening_Tags
column.
This was to be expected, though—opening tags are only set for puzzles occuring before move 20
since a tactic occuring within the first 20
moves of a game likely has features strongly influenced by the opening played, whereas puzzles occuring later may not be as strongly influenced by the opening.
Finally, we added a column for Puzzle_Length
that counts the number of moves in the puzzle from it's starting position. The first move in the Moves
column sets up the position to present to the player, so Puzzle_Length
is 1
less than the number of moves in Moves
.
Note that the number of moves the player must make is actually half of the number of moves in the Moves
column, or
$$\dfrac{\text{Puzzle Length} + 1}{2}.$$
I used a personal token generated from https://lichess.org/account/oauth/token to access my puzzle activity from the Lichess API at https://lichess.org/api/puzzle/activity.
The response is a .ndjson
file, i.e. a new-line delimited json
file. I had trouble getting pandas
to read this, so I first split the response text into a list, to which I applied json.loads()
to parse each element as a .json
object. Afterward, .json_normalize()
was able to convert this list of .json
objects into a dataframe.
Note that the date
column is in 13
-digit format. I converted these to datetime
.
My rating history was downloaded directly from https://lichess.org/api/user/tclark/rating-history as a .json
file. My puzzle rating history was encoded as a list of points at index 13
of this file. This list was read into a dataframe as below.
The month
column had January
corresponding to 0
, which I found strange—so, I added 1
to each entry in month
.