ehan03 / Tapology-Scraper

A Scrapy-based spider for historical UFC data on Tapology

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tapology-Scraper

Disclaimer

This project was created for educational purposes only. According to Tapology's Terms of Use, web scraping and using the site's data for commercial purposes are strictly prohibited—use at your own risk.

I will not be updating this repository.

Description

Tapology is one of the most data-rich websites related to MMA but also presents difficulties for developers who are looking to obtain this information systematically. These obstacles include JavaScript-based pagination, unique website structure, and anti-scraping measures to name a few. As an avid UFC fan and programmer, I saw this as an interesting challenge to tackle.

I've focused my efforts on UFC data, rooted here. You can find a subset of the output generated by running the spider for the most recent event at the time of writing, UFC 296, here.

Usage

  1. Clone/fork this repository.

  2. Create a virtual environment, activate it, and install the requirements. For reference, this project was created in Python 3.9.18.

conda create -n myenv python=3.9 pip
conda activate myenv
pip install -r requirements.txt
  1. Navigate into the Scrapy project directory.
cd tapology_scraper
  1. Run the spider with the scrape_type argument set to either all or most_recent. The option all will scrape all(*) past UFC event, fight, and fighter data. The option most_recent corresponds to data associated with just the most recent event.
scrapy crawl tapology_spider -a scrape_type=all
scrapy crawl tapology_spider -a scrape_type=most_recent

If you'd like to output the data to a JSON, add -o data.json to the end of one of the above commands. Otherwise, implement your own handling of the scraped items in pipelines.py, which has been left blank for the user.


(*) Tapology has very strong anti-scraping defenses in place (kudos to them). As a result, using the code as is with the scrape_type=all option will 100% get you IP-blocked after some time. This is by design and I will not discuss methods to circumvent these measures.

About

A Scrapy-based spider for historical UFC data on Tapology

License:MIT License


Languages

Language:Python 100.0%