Exploratory analysis on the Lichess puzzle database

See the Exploratory Analysis folder for more details.


Interactive version available here

Descriptive statistics

See lichess_db_puzzle_eda_descriptive.ipynb for more detials.

We used pandas to read the lichess_db_puzzle_clean.csv file from puzzle_journey_data_collection_processing.ipynb as a dataframe called puzzles_df. We found that there are over 3 million puzzles in the database.

We then examined the descriptive statistics using puzzles_df.describe().

Some interesting characteristics include the following.

The minimum puzzle rating is 545, while the maximum puzzle rating is 3,212.
The median puzzle rating is 1,514.
The median puzzle length is 3 moves (i.e. 2 moves made by the player).
The maximum puzzle length is 29 moves (i.e. 15 moves made by the player)!
There are puzzles in the database that have yet to be played.
Meanwhile, the maximum number of plays is over 1,000,000!

We also used chess to view some puzzles. For example, below is the puzzle with highest rating.

Here is the puzzle with the highest number of plays.

Finally, we examined the feature correlations.

Perhaps unsurprisingly,Puzzle_Length and Rating are moderately positively correlated (i.e. longer puzzles tend to be more difficult).
There is a weak negative correlation between Popularity and Rating_Deviation. This may indicate that popular puzzles tend to have lower rating deviation.

The second point seems less straight-forward than the first. However, think of Rating_Deviation as measuring the predictability of a puzzle's difficulty—a puzzle with low rating deviation is performing at a relatively stable rating (i.e. difficulty), while a puzzle with high rating deviation has a relatively unstable rating. Meanwhile, Popularity essentially measures whether a puzzle is meeting the users' expectations. In this interpretation, it makes sense that a more popular puzzle is performing as expected, hence would have a more predictable level of difficulty.

Distributions

See lichess_db_puzzle_eda_distributions.ipynb for more details.

We used ggplot to visualize the distributions of puzzle rating, rating deviation, popularity, number of plays, puzzle length, themes, and opening tags.

Puzzle rating

The distribution of ratings is unimodal and fairly symmetric. Below is a boxplot.

Rating deviation

The distribution of rating deviation is unimodal and right-skewed. Most puzzles have a rating deviation of less than 100. Below is a boxplot—note the potential outliers.

Popularity

The distribution of popularity is unimodal and left-skewed. Most puzzles are rather popular, suggesting there are few puzzles that users find to be inaccurate, poorly designed, or otherwise unfair. Below is a boxplot—note the potential outliers.

Number of plays

The distribution of number of plays is unimodal and right-skewed. There are comparatively few puzzles with more than 2,000 plays. Below is a boxplot—note the potential outliers and small interquartile range.

Puzzle length

Puzzle length ranges from 1 to 29 (i.e. from 1 to 15 player moves), though the distribution is right-skewed. The most common puzzle length is 3, which corresponds to 2 moves made by the player. Below is a boxplot—note the potential outliers occuring around length 9--11.

Themes

There are 60 distinct themes (not including the healthyMix and playerGames themes). The 5 most frequently occuring themes are as follows.

short
middlegame
crushing
endgame
advantage

It is interesting to recall here that these puzzles are generated by user games on lichess.org. So, be on the lookout for these sorts of tactics in your games!

Opening tags

There are over 100 opening tags, not including variations! The most common opening in the puzzle database is the Sicilian_Defense—watch out for early tactics in your Sicilian games!

Puzzle rating

See lichess_db_puzzle_eda_rating.ipynb for more details.

The strongest feature correlation to puzzle rating was with puzzle length, vizualized below.

The median puzzle rating increases as puzzle length increases.

We investigated the relationship between puzzle theme and puzzle ratings, as well.

The themes with median puzzle rating greater than 2,000 are as follows.

castling
quietMove
veryLong
underPromotion
mateIn5
enPassant
defensiveMove
zugswang

These are all themes that involve either relatively long puzzles or moves that are subtle, rare, or otherwise non-routine.

Meanwhile, bankRankMate has the lowest median rating—these puzzles are generally considered to be quiet easy.

Finally, we looked at opening tag as it relates to puzzle rating.

The median rating is pretty consistent across openings. The Zukertort_Defense and Amar_Gambit are the only openings with a median rating over 2,000, while the Borg_Opening is the only opening with a median rating under 1,000.

Rating deviation

See lichess_db_puzzle_eda_deviation.ipynb for more details.

Rating deviation is very consistent across most features in the database. Two things that stood out from our investigations here are the following.

The theme equality has a much higher median rating deviation than the other themes.

The opening with the highest median rating deviation is the Norwegian_Defense.

Popularity

See lichess_db_puzzle_eda_popularity.ipynb for more details.

Aside from the relationship between Puzzle_Length and Rating, the next strongest correlation was between Popularity and Rating_Deviation, visualized below.

Observe, as popularity increases, the median rating deviation decreases.

The other features seemed to have little impact on popularity—notably, median popularity is fairly stable when compared across themes.

Notably, the Queens_Pawn_Mengarini_Attack is the opening with lowest median popularity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lichess Puzzle Database Overview.md

Lichess Puzzle Database Overview.md

Exploratory analysis on the Lichess puzzle database

Contents

Descriptive statistics

Distributions

Puzzle rating

Rating deviation

Popularity

Number of plays

Puzzle length

Themes

Opening tags

Puzzle rating

Rating deviation

Popularity

Files

Lichess Puzzle Database Overview.md

Latest commit

History

Lichess Puzzle Database Overview.md

File metadata and controls

Exploratory analysis on the Lichess puzzle database

Contents

Descriptive statistics

Distributions

Puzzle rating

Rating deviation

Popularity

Number of plays

Puzzle length

Themes

Opening tags

Puzzle rating

Rating deviation

Popularity