play2vec

This is the Implementation of "Effective and Efficient Sports Play Retrieval with Deep Representation Learning" (KDD 2019).

Requirements

Linux Ubuntu OS (16.04 is tested)
Python (3.6 is tested)
Tensorflow-GPU (1.8.0 is tested)

Please refer to the source code to install the required packages such as matplotlib in Python. You can install packages with conda in a shell as

conda install matplotlib

Dataset

The dataset is a real-world soccer player tracking data collected by STATS. Download the dataset by requesting STATS Artificial Intelligence and put the compressed *.kpl files into ./SoccerData. Note that the intermediate results generated by the algorithm will also be saved in this folder.

Data format

The training kpl file contains around 7500 sequences. In each sequence, it consists of the tracking data of three parts: 11 defense players, 11 attacking players and a ball. All of these have two fields horizontal and vertical coordinates obtained at a sampling frequency of 10Hz. If you want to test your own data, please also refer to this format. More details can be found in official data instruction.

Running Procedures

Data Visualization

First, you can visualize the sports data by running preprocess.py or play a specified segment by viz.py. Note that these two parts are supported by matplotlib and turtle packages in Python 3.6.

Building Corpus

Run ogm.py. It splits each play into a sequence of non-overlapping segments with a fixed duration and maps the coordinates to grids.

python3 ogm.py

After such a process, you need to generate the corrupted version of noise and dropping respectively.

python3 corrupted_noise.py

python3 corrupted_drop.py

Then, building a sports corpus by calculating the Jaccard index for measuring the similarity between two segment matrices.

python3 building.py

Training

Run embedding.py. It learns distributed representation for each segment token under the Skip-Gram model. We visualize the effect of the representations through the validation data during the training process and draw the high-dimensional segment embeddings using t-SNE.

python3 embedding.py

The embedding matrix will be saved in ./SoccerData after you finish the above step. Finally, we use the DSED Model to glue all distributed representations of the play segments together.

python3 dae.py

Testing

We provide a simple Top-K task for testing in estimate.py

represent = get_result(source_int, embed_mat, source_letter_to_int, path1)[0] #the vector representations of plays

python3 estimate.py

Citing play2vec

Please cite our paper if you find this code is useful

@inproceedings{kdd19wz,
  author    = {Wang, Zheng and Long, Cheng and Cong, Gao and Ju, Ce},
  title     = {Effective and Efficient Sports Play Retrieval with Deep Representation Learning},
  booktitle = {Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining},
  year      = {2019},
  organization = {ACM}
}

GeometricBCI / play2vec