See the Exploratory Analysis folder for more details.
Interactive version available here |
See
lichess_db_puzzle_eda_descriptive.ipynb
for more detials.
We used pandas
to read the lichess_db_puzzle_clean.csv
file from puzzle_journey_data_collection_processing.ipynb
as a dataframe called puzzles_df
. We found that there are over 3
million puzzles in the database.
We then examined the descriptive statistics using puzzles_df.describe()
.
Some interesting characteristics include the following.
- The minimum puzzle rating is
545
, while the maximum puzzle rating is3,212
. - The median puzzle rating is
1,514
. - The median puzzle length is
3
moves (i.e.2
moves made by the player). - The maximum puzzle length is
29
moves (i.e.15
moves made by the player)! - There are puzzles in the database that have yet to be played.
- Meanwhile, the maximum number of plays is over
1,000,000
!
We also used chess
to view some puzzles. For example, below is the puzzle with highest rating.
Here is the puzzle with the highest number of plays.
Finally, we examined the feature correlations.
- Perhaps unsurprisingly,
Puzzle_Length
andRating
are moderately positively correlated (i.e. longer puzzles tend to be more difficult). - There is a weak negative correlation between
Popularity
andRating_Deviation
. This may indicate that popular puzzles tend to have lower rating deviation.
The second point seems less straight-forward than the first. However, think of Rating_Deviation
as measuring the predictability of a puzzle's difficulty—a puzzle with low rating deviation is performing at a relatively stable rating (i.e. difficulty), while a puzzle with high rating deviation has a relatively unstable rating. Meanwhile, Popularity
essentially measures whether a puzzle is meeting the users' expectations. In this interpretation, it makes sense that a more popular puzzle is performing as expected, hence would have a more predictable level of difficulty.
See
lichess_db_puzzle_eda_distributions.ipynb
for more details.
We used ggplot
to visualize the distributions of puzzle rating, rating deviation, popularity, number of plays, puzzle length, themes, and opening tags.
The distribution of ratings is unimodal and fairly symmetric. Below is a boxplot.
The distribution of rating deviation is unimodal and right-skewed. Most puzzles have a rating deviation of less than 100
. Below is a boxplot—note the potential outliers.
The distribution of popularity is unimodal and left-skewed. Most puzzles are rather popular, suggesting there are few puzzles that users find to be inaccurate, poorly designed, or otherwise unfair. Below is a boxplot—note the potential outliers.
The distribution of number of plays is unimodal and right-skewed. There are comparatively few puzzles with more than 2,000
plays. Below is a boxplot—note the potential outliers and small interquartile range.
Puzzle length ranges from 1
to 29
(i.e. from 1
to 15
player moves), though the distribution is right-skewed. The most common puzzle length is 3
, which corresponds to 2
moves made by the player. Below is a boxplot—note the potential outliers occuring around length 9--11
.
There are 60
distinct themes (not including the healthyMix
and playerGames
themes). The 5
most frequently occuring themes are as follows.
short
middlegame
crushing
endgame
advantage
It is interesting to recall here that these puzzles are generated by user games on lichess.org. So, be on the lookout for these sorts of tactics in your games!
There are over 100
opening tags, not including variations! The most common opening in the puzzle database is the Sicilian_Defense
—watch out for early tactics in your Sicilian games!
See
lichess_db_puzzle_eda_rating.ipynb
for more details.
The strongest feature correlation to puzzle rating was with puzzle length, vizualized below.
The median puzzle rating increases as puzzle length increases.
We investigated the relationship between puzzle theme and puzzle ratings, as well.
The themes with median puzzle rating greater than 2,000
are as follows.
castling
quietMove
veryLong
underPromotion
mateIn5
enPassant
defensiveMove
zugswang
These are all themes that involve either relatively long puzzles or moves that are subtle, rare, or otherwise non-routine.
Meanwhile, bankRankMate
has the lowest median rating—these puzzles are generally considered to be quiet easy.
Finally, we looked at opening tag as it relates to puzzle rating.
The median rating is pretty consistent across openings. The Zukertort_Defense
and Amar_Gambit
are the only openings with a median rating over 2,000
, while the Borg_Opening
is the only opening with a median rating under 1,000
.
See
lichess_db_puzzle_eda_deviation.ipynb
for more details.
Rating deviation is very consistent across most features in the database. Two things that stood out from our investigations here are the following.
- The theme
equality
has a much higher median rating deviation than the other themes.
- The opening with the highest median rating deviation is the
Norwegian_Defense
.
See
lichess_db_puzzle_eda_popularity.ipynb
for more details.
Aside from the relationship between Puzzle_Length
and Rating
, the next strongest correlation was between Popularity
and Rating_Deviation
, visualized below.
Observe, as popularity increases, the median rating deviation decreases.
The other features seemed to have little impact on popularity—notably, median popularity is fairly stable when compared across themes.
Notably, the Queens_Pawn_Mengarini_Attack
is the opening with lowest median popularity.