autoscout

Football (soccer) scouting via publicly available data.

Usage

Setup the repository and a virtual environment with requirements:

$ git clone https://github.com/olliestanley/autoscout.git
$ cd autoscout
$ python -m venv venv
$ source venv/bin/activate
$ python -m pip install -qr requirements.txt

Getting Data

Download Premier League 2021-22 outfield player data from fbref via CLI:

$ python scripts/data/download_fbref_aggregate.py --competition eng1 --season 2022 --type outfield

Download La Liga current season team data from fbref (append --vs to get data against the team):

$ python scripts/data/download_fbref_aggregate.py --competition spa1 --season current --type team

Download Frenkie de Jong 2021-22 player match-by-match data from fbref:

$ python scripts/data/download_fbref_match.py --dataset frenkie_de_jong --season 2022

Download Manchester United 2022-23 team match-by-match data from fbref (append --vs to get data against the team):

$ python scripts/data/download_fbref_match.py --dataset manchester_united --season 2023

Add to or alter config/fbref/matches.json to add extra players or teams to the available list. Note that building a dataset of a large number of players and/or teams may require significant effort as each entity has a unique identifier which you must obtain. In future it may be possible to scrape an ID to player/team mapping but this is not currently supported.

Load data into a Pandas or Polars DataFrame:

from autoscout import util
# Specify format="polars" for a Polars DataFrame
df = util.load_csv("data/fbref/eng1/2022/outfield.csv", format="pandas")

Combine DataFrames to create a single dataset, such as from multiple competitions or multiple seasons of the same competition.

from autoscout import preprocess

combined = preprocess.combine_data((df_1, df_2))

Creating Visualisations

Plot a Midfielder radar chart, based on a loaded df:

from autoscout import util
from autoscout.vis import radar

midfield_config = util.load_json("config/radar/midfield.json")
rdr, fig, ax = radar.plot_radar_from_config(df, midfield_config, "Fred")

Radar configurations can be customised and modified by editing the .json fles in config/radar. It is also possible to plot radars without a .json configuration file using radar.plot_radar(...).

Plot rolling xG for and against chart for a team with dashed trend lines and shading the gap between xG For and xG Against, using a loaded team match by match df:

from autoscout import preprocess
from autoscout.vis import chart

df = preprocess.rolling(df, ["xg_for", "xg_against"])
df["n"] = df.index

plot = chart.lines(
    df, ["n", "n"], ["xg_for_roll_mean", "xg_against_roll_mean"],
    colors=["green", "red"], legend_labels=["xG For", "xG Against"],
    trends=True, vshade=(0, 1), title="10 game rolling average xG",
    x_axis_label="Date", y_axis_label="xG"
)

Searching Data

Find 6 players in the dataset most similar to Paul Pogba in the statistics in columns, after applying per 90 adjustment to normalize the data:

from autoscout import preprocess, search

columns = ["goals", "npxg", "assists", "xa"]
df = preprocess.adjust_per_90(df, columns)
similar_df = search.search_similar(df, columns, "Paul Pogba", num=6)

Filter a team dataset to contain only teams which have scored at least 50 goals and have exactly 19 players used:

from autoscout import util, search

criteria = {
    "gte": { "goals": 50.0 },
    "eq": { "players_used": 19.0 }
}

df_teams = util.load_csv("data/fbref/eng1/2022/team_for.csv")
matching_df = search.search(df_teams, criteria)

Analysing Data

Create stylistic ratings for all players or teams in a dataset from a loaded df, based on pre-existing configuration:

from autoscout import analyse, util

ratings_config = util.load_json("config/rating_inputs.json")
df = analyse.estimate_style_ratings(df, ratings_config)

df["progress_rating"]

Ratings based on custom defined sets of statistics can easily be computed by adding sections to rating_inputs.json.

Reduce the dimensionality of 4 columns of a dataset df into 2 columns. This is used by estimate_style_ratings() to derive stylistic ratings from raw statistics, but may be useful for other purposes.

from autoscout import analyse

columns = ["goals", "assists", "xg", "xa"]
df["ga_rating"] = analyse.reduce_dimensions(df, columns, reducer=1)

A custom reducer from SciKit-Learn can be specified in reduce_dimensions(), otherwise an integer value for the output number of dimensions can be specified. This defaults to 1 if no value is specified.

Cluster players or teams into groups based on statistical similarities in the specified columns:

from autoscout import analyse

columns = ["goals", "assists", "xg", "xa"]
df["cluster"] = analyse.cluster_records(df, columns, estimator="auto")

Again, a custom estimator from SciKit-Learn can be specified in cluster_records(), otherwise a KMeans estimator is automatically fitted. The appropriate number of clusters is also automatically derived.

Developers

Oliver Stanley

Suggestions

Adding new functionality to autoscout, such as means of obtaining data from new sources or new analytical tools, is always of interest. Feel free to open a GitHub Issue with any suggestions.

Structure

├── LICENSE
├── README.md
├── requirements.txt
├── .gitignore
├── setup.py
│
├── autoscout          <- Python source root for autoscout
│   ├── data           <- Code for acquiring data
│   └── vis            <- Code for visualising data
│
├── config             <- Configuration values for feeding to autoscout functions
│
├── scripts            <- Reusable scripts for using autoscout
│   └── data           <- Scripts for acquiring data for analysis via command line
│
├── data               <- Downloaded data, not included in source control
└── notebooks          <- Experimental notebooks, not included in source control

olliestanley / autoscout