This is the Implementation of "Effective and Efficient Sports Play Retrieval with Deep Representation Learning" (KDD 2019).
- Linux Ubuntu OS (16.04 is tested)
- Python (3.6 is tested)
- Tensorflow-GPU (1.8.0 is tested)
Please refer to the source code to install the required packages such as matplotlib in Python. You can install packages with conda in a shell as
conda install matplotlib
The dataset is a real-world soccer player tracking data collected by STATS. Download the dataset by requesting STATS Artificial Intelligence and put the compressed *.kpl
files into ./SoccerData
. Note that the intermediate results generated by the algorithm will also be saved in this folder.
The training kpl file contains around 7500 sequences. In each sequence, it consists of the tracking data of three parts: 11 defense players, 11 attacking players and a ball. All of these have two fields horizontal and vertical coordinates obtained at a sampling frequency of 10Hz. If you want to test your own data, please also refer to this format. More details can be found in official data instruction.
First, you can visualize the sports data by running preprocess.py
or play a specified segment by viz.py
. Note that these two parts are supported by matplotlib and turtle packages in Python 3.6.
Run ogm.py
. It splits each play into a sequence of non-overlapping segments with a fixed duration and maps the coordinates to grids.
python3 ogm.py
After such a process, you need to generate the corrupted version of noise and dropping respectively.
python3 corrupted_noise.py
python3 corrupted_drop.py
Then, building a sports corpus by calculating the Jaccard index for measuring the similarity between two segment matrices.
python3 building.py
Run embedding.py
. It learns distributed representation for each segment token under the Skip-Gram model. We visualize the effect of the representations through the validation data during the training process and draw the high-dimensional segment embeddings using t-SNE.
python3 embedding.py
The embedding matrix will be saved in ./SoccerData
after you finish the above step. Finally, we use the DSED Model to glue all distributed representations of the play segments together.
python3 dae.py
We provide a simple Top-K task for testing in estimate.py
represent = get_result(source_int, embed_mat, source_letter_to_int, path1)[0] #the vector representations of plays
python3 estimate.py
Please cite our paper if you find this code is useful
@inproceedings{kdd19wz,
author = {Wang, Zheng and Long, Cheng and Cong, Gao and Ju, Ce},
title = {Effective and Efficient Sports Play Retrieval with Deep Representation Learning},
booktitle = {Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining},
year = {2019},
organization = {ACM}
}