Request option to detect doubles when games are included in other games.
Jonathan003 opened this issue · comments
Describe the feature
This is an example where Scid finds a duplicated game because on game is included in another game.
Ath the moment en-croissant don't detect these doubles.
Scid only detect these count of doubles when the player names are identical
I also want an option to detect these doubles when player names are different or other information like event is different. Than I want en-croissant to keep the better game, (longer game, more recent, higher elo, longer time control ,etc).
Maybe it are technically not exact doubles. But I don't see how it can be useful to keep games with exact the same moves, and exact the same results, together in a database.
Doubles where one game is included in another happens quite often in human tournaments when using DGT boards, sometimes the feed gets an extra move or a database feed like TWIC does an update one week and provides a game correction the next week.
I have made a Python script for myself to delete these duplicates, with the help of Microsoft Copilot Pro.
I'm not a programmer so it could be that there are mistakes in the script, (please let me know if you find any).
Here is the readme.txt of the script:
Remove_duplicates_and_prefixes.py This script removes duplicate games from a given input PGN file and writes the unique games to an output PGN file named duplicates_removed.pgn. Steps to use:
- Save the script in a directory on your computer.
- Open a terminal or command prompt.
- Navigate to the directory where you saved the script using the cd command.
- Run the script by typing python remove_duplicates_and_prefixes.py and press Enter. The script will look for an input file named input.pgn in the same directory.
Here is the script:
import chess.pgn
import os
import sys
from datetime import datetime
def get_quality_score(game):
rating_weight = 0.4
recency_weight = 0.2
time_format_weight = 0.2
length_weight = 0.2
white_rating = int(game.headers.get('WhiteElo', '0'))
black_rating = int(game.headers.get('BlackElo', '0'))
average_rating = (white_rating + black_rating) / 2
date_str = game.headers.get('Date', '1800.01.01')
date = datetime.strptime(date_str, '%Y.%m.%d')
recency = (datetime.now() - date).days
event = game.headers.get('Event', '').lower()
time_format_score = 1 if 'bullet' in event else 2 if 'blitz' in event else 3
length = len(list(game.mainline_moves()))
quality_score = (rating_weight * average_rating) - (recency_weight * recency) + (time_format_weight * time_format_score) + (length_weight * length)
return quality_score
def remove_duplicates_and_prefixes(input_pgn_path, output_pgn_path):
games = {}
with open(input_pgn_path) as pgn:
while True:
game = chess.pgn.read_game(pgn)
if game is None:
break
moves = str(game.mainline_moves())
if moves not in games:
games[moves] = game
else:
old_game = games[moves]
new_game = game
old_score = get_quality_score(old_game)
new_score = get_quality_score(new_game)
if new_score > old_score:
games[moves] = game
with open(output_pgn_path, "w") as output_pgn:
for game in games.values():
output_pgn.write(str(game))
output_pgn.write("\n\n")
def main():
try:
input_pgn_path = sys.argv[1]
output_pgn_path = "duplicates_removed.pgn"
remove_duplicates_and_prefixes(input_pgn_path, output_pgn_path)
except FileNotFoundError as e:
print(f"File not found: {e.filename}")
except Exception as e:
print(f"An error occurred: {str(e)}")
if __name__ == "__main__":
main()
It would be nice if some similar functionality to detect these duplicates, could be added to en-croissant