franciscoBSalgueiro / en-croissant

The Ultimate Chess Toolkit

Home Page:https://encroissant.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Request option to detect doubles when games are included in other games.

Jonathan003 opened this issue · comments

Describe the feature

image
This is an example where Scid finds a duplicated game because on game is included in another game.
Ath the moment en-croissant don't detect these doubles.
Scid only detect these count of doubles when the player names are identical
I also want an option to detect these doubles when player names are different or other information like event is different. Than I want en-croissant to keep the better game, (longer game, more recent, higher elo, longer time control ,etc).
Maybe it are technically not exact doubles. But I don't see how it can be useful to keep games with exact the same moves, and exact the same results, together in a database.
Doubles where one game is included in another happens quite often in human tournaments when using DGT boards, sometimes the feed gets an extra move or a database feed like TWIC does an update one week and provides a game correction the next week.

I have made a Python script for myself to delete these duplicates, with the help of Microsoft Copilot Pro.
I'm not a programmer so it could be that there are mistakes in the script, (please let me know if you find any).

Here is the readme.txt of the script:
Remove_duplicates_and_prefixes.py This script removes duplicate games from a given input PGN file and writes the unique games to an output PGN file named duplicates_removed.pgn. Steps to use:

  1. Save the script in a directory on your computer.
  2. Open a terminal or command prompt.
  3. Navigate to the directory where you saved the script using the cd command.
  4. Run the script by typing python remove_duplicates_and_prefixes.py and press Enter. The script will look for an input file named input.pgn in the same directory.

Here is the script:

import chess.pgn
import os
import sys
from datetime import datetime

def get_quality_score(game):
    rating_weight = 0.4
    recency_weight = 0.2
    time_format_weight = 0.2
    length_weight = 0.2
    white_rating = int(game.headers.get('WhiteElo', '0'))
    black_rating = int(game.headers.get('BlackElo', '0'))
    average_rating = (white_rating + black_rating) / 2
    date_str = game.headers.get('Date', '1800.01.01')
    date = datetime.strptime(date_str, '%Y.%m.%d')
    recency = (datetime.now() - date).days
    event = game.headers.get('Event', '').lower()
    time_format_score = 1 if 'bullet' in event else 2 if 'blitz' in event else 3
    length = len(list(game.mainline_moves()))
    quality_score = (rating_weight * average_rating) - (recency_weight * recency) + (time_format_weight * time_format_score) + (length_weight * length)
    return quality_score

def remove_duplicates_and_prefixes(input_pgn_path, output_pgn_path):
    games = {}
    with open(input_pgn_path) as pgn:
        while True:
            game = chess.pgn.read_game(pgn)
            if game is None:
                break
            moves = str(game.mainline_moves())
            if moves not in games:
                games[moves] = game
            else:
                old_game = games[moves]
                new_game = game
                old_score = get_quality_score(old_game)
                new_score = get_quality_score(new_game)
                if new_score > old_score:
                    games[moves] = game

    with open(output_pgn_path, "w") as output_pgn:
        for game in games.values():
            output_pgn.write(str(game))
            output_pgn.write("\n\n")

def main():
    try:
        input_pgn_path = sys.argv[1]
        output_pgn_path = "duplicates_removed.pgn"
        remove_duplicates_and_prefixes(input_pgn_path, output_pgn_path)
    except FileNotFoundError as e:
        print(f"File not found: {e.filename}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    main()

It would be nice if some similar functionality to detect these duplicates, could be added to en-croissant