Data Collection and Processing Overview

See puzzle_journey_data_collection_processing.ipynb for details.

Lichess puzzle database

The Lichess puzzle database was downloaded from https://database.lichess.org/#puzzles on March 22, 2023. The data was in the form of a compressed .csv using zstd compression. This file was decompressed in command line and converted to a pandas dataframe using .read_csv().

The data initially contained no headers, so we added headers in line with the database documentation.

Checking for missing or null values, we found quite a few in the Opening_Tags column.

This was to be expected, though—opening tags are only set for puzzles occuring before move 20 since a tactic occuring within the first 20 moves of a game likely has features strongly influenced by the opening played, whereas puzzles occuring later may not be as strongly influenced by the opening.

Finally, we added a column for Puzzle_Length that counts the number of moves in the puzzle from it's starting position. The first move in the Moves column sets up the position to present to the player, so Puzzle_Length is 1 less than the number of moves in Moves.

Note that the number of moves the player must make is actually half of the number of moves in the Moves column, or

$$\dfrac{\text{Puzzle Length} + 1}{2}.$$

My puzzle activity

I used a personal token generated from https://lichess.org/account/oauth/token to access my puzzle activity from the Lichess API at https://lichess.org/api/puzzle/activity.

The response is a .ndjson file, i.e. a new-line delimited json file. I had trouble getting pandas to read this, so I first split the response text into a list, to which I applied json.loads() to parse each element as a .json object. Afterward, .json_normalize() was able to convert this list of .json objects into a dataframe.

Note that the date column is in 13-digit format. I converted these to datetime.

My puzzle rating history

My rating history was downloaded directly from https://lichess.org/api/user/tclark/rating-history as a .json file. My puzzle rating history was encoded as a list of points at index 13 of this file. This list was read into a dataframe as below.

The month column had January corresponding to 0, which I found strange—so, I added 1 to each entry in month.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Collection and Processing Overview.md

Data Collection and Processing Overview.md

Data Collection and Processing Overview

Contents

Lichess puzzle database

My puzzle activity

My puzzle rating history

Files

Data Collection and Processing Overview.md

Latest commit

History

Data Collection and Processing Overview.md

File metadata and controls

Data Collection and Processing Overview

Contents

Lichess puzzle database

My puzzle activity

My puzzle rating history