analysis chess dask data-mining databases decision-tree decision-tree-classifier jupyter-notebook k-means-clustering lichess linear-regression machine-learning multiprocessing pandas patterns random-forest random-forest-classifier seaborn sklearn stockfish

Machine Learning to Study Patterns in Chess Games

A final year project for the University of Exeter, using data mining and machine learning to understand patterns in chess games over the scale of millions of games.

Ranked 1st in the cohort for undergraduate projects, with a final grade of 85%.

Supervisor: Chico Camargo

Project Description

Demonstration Video

Chess, one of the oldest and most popular board games, has recently seen large-scale exploration driven by online platforms like Lichess and Chess.com. This study uses data mining and machine learning techniques to analyse millions of games from the Lichess Open Database to uncover patterns and insights into how people play chess. We created a data pipeline for efficient processing, explored data features and distribution, and performed feature engineering. We implemented classification models to predict game outcomes, a regression model to investigate opening popularity and game outcomes, and k-means clustering to group openings by game outcomes and mean differences in their variations. Our models and clusters were analysed for their usefulness and evaluated for their ability to provide insights into chess game patterns. Through this evaluation, we assessed their success, limitations, and potential improvements. This study demonstrates the potential of data mining and machine learning techniques in uncovering patterns and insights in chess, contributing to growing research. By understanding how people play chess, we can develop better tools, strategies, and educational resources, enhancing fairness and enjoyment for players worldwide.

Data Pipeline

Installation

Python Versions

This project has been developed and tested to work on Python 3.11, but it should work for Python 3.8 onwards. We recommend that you run this project on Python 3.11, as it is 25% faster than Python 3.10 on average.

Installing Dependencies

Ensure that you're in the root directory: machine-learning-in-chess
Install the Python dependencies: pip install -r requirements.txt

External Libraries

pgn-extract

pgn-extract is a command-line tool written in C that is used to manipulate chess databases (formatted as PGN files) with millions of games.

Compiling pgn-extract

Before we can use pgn-extract, we must compile it from the source code:

Download the source code from the pgn-extract website.
Open the terminal and navigate to the source code directory for pgn-extract.
Compile the program by running: make
pgn-extract should now be compiled and ready to use via the pgn-extract file in the source code directory.

Splitting a PGN File with pgn-extract

We will use pgn-extract to split a PGN file into six smaller PGN files – this enables us to take a sample of games from the larger PGN file, and then process the smaller PGN files in parallel.

Open the terminal and navigate to the source code directory for pgn-extract.
Run the following command, where <NUM_GAMES> is the maximum number of games per PGN file output and <PGN_FILE> is the absolute file path to the PGN file to split:
```
./pgn-extract -#<NUM_GAMES> "<PGN_FILE>"
```
For example:
```
./pgn-extract -#1000000 "/Users/isaac/Downloads/lichess_db_standard_rated_2022-01.pgn"
```
The split PGN files will be saved as 1.pgn, 2.pgn, 3.pgn..., 6.pgn in the same directory as pgn-extract. We will only use the first six files that are output.
Rename the PGN files to the original PGN file name, but with a suffix .<NUM>.pgn indicating the number of the PGN file to ensure that it will work with convert_pgn_to_parquet.py later (e.g. lichess_db_standard_rated_2022-01_1.pgn, lichess_db_standard_rated_2022-01_2.pgn, lichess_db_standard_rated_2022-01_3.pgn, ..., lichess_db_standard_rated_2022-01_6.pgn).

Usage and Reproducing the Results

Analysis of Provided Data Set

For convenience, we have pre-processed a data set of 40,121,728 standard rated games on Lichess from January 2022 to December 2022 (inclusive) and provided it in the resources/lichess_db_standard_rated_2022 directory. This is also the default data set used in the Jupyter Notebook where we perform the analysis.

To run the analysis on the provided data set, run the code in analyse_chess_data.ipynb.

Analysis of Different Data Sets

If you would like to use a different data set from the Lichess Open Database, you can follow these steps to reproduce the results:

Download a data set from the Lichess Open Database.
Decompress the data into a PGN file (.pgn) – instructions are provided on the Lichess Open Database page under the 'Decompress .zst' heading.
Split the PGN file into six smaller PGN files, each containing up to a specified number of games (e.g. 1,000,000) with pgn-extract.
- See the section above for detailed instructions: Splitting a PGN File with pgn-extract
Extract the game metadata from PGN files to a CSV file and a folder of Parquet files by running convert_pgn_to_parquet.py and providing the name of the original PGN file (before it was split, e.g. lichess_db_standard_rated_2022-01.pgn).
- This will output a folder of Parquet files, as well as a CSV file that contains all the data (e.g. lichess_db_standard_rated_2022-01.pgn for the folder of Parquet files and lichess_db_standard_rated_2022-01.csv).
(Optional) If you want to use data sets from multiple months (like in our study), merge the CSV files from the previous step into a single CSV file and a folder of Parquet files by running merge_csv_files.py and providing the paths of the CSV files to merge.
Change the DATA_PATH variable to the path of the directory containing the Parquet files in analyse_csv_data.ipynb, and then run the notebook.

Manual Data Exploration

If you want to manually explore the data, we have provided a program to convert the CSV output to an SQLite3 file (.db). This makes it easy to perform queries and sorting, as the CSV file outputs may be too large to view directly.

Run convert_csv_to_sqlite3.py.
Enter the path to the CSV file to convert (e.g. lichess_db_standard_rated_2022-01.csv).
View the output SQLite3 file (e.g. lichess_db_standard_rated_2022-01.db) in the database browser of your choice (e.g. DB Browser for SQLite).

Future Work

Scoutfish

Scoutfish is a tool written in C++ that is used to query chess databases (formatted as PGN files) with very high speed.

Compiling Scoutfish

Download the source code from the Scoutfish GitHub repository.
Open the terminal and navigate to the src directory in the source code for Scoutfish: cd src
Compile the program by running: make build ARCH=x86-64
Scoutfish should now be compiled and ready to use via the scoutfish file in the src directory.

Creating a Scoutfish Index

Before Scoutfish can be used to query a chess database, we must first create a Scoutfish index for that database:

Open the terminal and navigate to the src directory in the source code for Scoutfish: cd src
Run the following command, where <PGN_FILE> is the absolute file path to the PGN:
```
./scoutfish make "<PGN_FILE>"
```
The Scoutfish index will be created in the same directory as the PGN file as a .scout file (e.g. the index for lichess_db_standard_rated_2022-01.pgn will be saved as lichess_db_standard_rated_2022-01.scout).

We can use the Scoutfish index to perform various queries. Further information and examples can be found on the Scoutfish GitHub repository.

About

A final year project for the University of Exeter, using machine learning to study patterns in millions of chess games (~350 GB). Ranked 1st in the cohort for undergraduate projects (85%).

analysis chess dask data-mining databases decision-tree decision-tree-classifier jupyter-notebook k-means-clustering lichess linear-regression machine-learning multiprocessing pandas patterns random-forest random-forest-classifier seaborn sklearn stockfish

Languages

Language:Jupyter Notebook 62.4%Language:C++ 17.6%Language:C 14.6%Language:HTML 2.9%Language:Makefile 1.3%Language:Python 0.6%Language:Shell 0.4%Language:CSS 0.1%Language:Roff 0.0%