CrazySherman / webspider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

webspider

The project is dedicated to analyze NBA lineup data and provide a regression model that can predict team's lineup scores. However, given the only the web stats available on NBA.com, we need a dedicated crawl to scrape the nba database and form our dataset. The sample folder contains a Scrapy project including multiple spider classes, one of them is used to scrape NBA player stats. As for the regression model, since it's obviously non-linear, and we dont wanna mess with non-linear SVM kernels. Surpirsingly after training CNN for a few months, I guess NN might be a good method to train non-linear regression models, so here to construct a simple neural network from brick & cement, and see how it goes with NBA lineup data.

Data scraper

The sample folder is organized as standard Scrapy structure, webspider/sample/sample/spiders/PlayerSpider.py is the spider to extract player stats. We have included the following metrics of a player: "FGM, FGA, 3PM, 3PA, FTM, REB, AST, TOV, STL, BLK, +/-" together 11 metrics. The spider initates a 2 level crawling: first went to player.nba.com to extract player index number, then following the index number url link to the player profile page to scrape the stat table. One difficulty we need to overstep is javascript rendering, both player index number and stats(metrics) are js-rendered, and scrapy itself doesn't have a js rendering library, so we started another Splash server on another host and send POST request to the Splash server to rendered the given url. 3 parameters govern the performance of the spider: the number of request u pass to the crawling pipeline, time interval between each request, and the wait argumenting passed to Splash renderer. The parameter values are listed in the first block of the file, bad parameters could result in constant render failure, and there is also limited processing load on Splash server. It turns out that NBA.com is purposely cutting down spider rendering traffics. During my experiment, Nba.com apparently refused frequent rendering request from the same server. As a result it took me almost a day to download all the player data, almost as painful as click-and-write myself. The player stats table is written in players.jl file as json objects.
Lineup data consist of two fields: 5 players on the court, which forms the "lineup", and the +/- score. We already have 11 metrics for each player, so there are total 55 length feature vector for a lineup sample, and +/- score is used as regression label. There are myriads of lineup data in the official database, however I cannot execute the page-turning js code on the stats page, even with Splash's js-source argument. Writing Chrome add-on could possibly solve this, but i seriously dont have the time and mood. So i just manually downloaded the 2015/16 playoffs lineup data, and use xpath to extract lineup data into the webspider/mystats.txt file.

Simple Neuraul Network training

The neural network here is a 2-layer fully-connected layer, the code is written in webspider/simpleNN.py. Since the network is about to produce +/- score, the second hidden layer is summed to a single output neuron, where RSME loss is calcuated of the batch data. ReLu is used as activation function, tho not proved to be effective in regression model. Parameters are initialized with Xaiver method, and input data are preprocessed with zero-centering and normalization, which is very important to the health status of neurons. Batch size and learning rate are studied/optimized through experiment, and we separarte the dataset as 9:1 training and validation set, eventually 1.2 RMSE training loss and 2 validation loss is achieved, further expanding the depth can improve the loss value. However, to target <1 loss is still hard provided the non-linearity feature of the model. Consider the order of the players in the feature vector, if they are re-ordered the result should be the same, however this feature is hard to learn by FC layer.

About


Languages

Language:HTML 98.0%Language:Julia 1.3%Language:Python 0.6%