JohnShuford/FITDAT

This is the start of the most comprehensive and wide reaching fitness database ever constructed

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Contact
Acknowledgements

About The Project

I scraped all of the performance data from the concept 2 website. I then cleaned and normalized all of that data moved it into an SQL database then visualized everything in Tableau. Follow this link to view and interact with the dashboard.

Built With

This section should list any major frameworks that you built your project using. Leave any add-ons/plugins for the acknowledgements section. Here are a few examples.

Getting Started

To get a local copy up and running follow these simple example steps.

Prerequisites

This is a list of all of the packages that you need to have installed in your virtual environment for the project to run.

pandas
matplotlib
splinter
time
bs4
webdriver_manager

Installation

Get a free API Key at https://example.com

Clone the repo

git clone https://github.com/your_username_/Project-Name.git

Install NPM packages
```
npm install
```
Enter your API in config.js
```
const API_KEY = 'ENTER YOUR API';
```

Usage

Beautiful Soup

Here is an example of what I did when I used Beautiful Soup to scrape all of the data from the Concept 2 Website!

#The loop to scrape the C2 data
for x in (season):
    #creating the season variable for the scrape
    currenet_season = x
    
    #getting the first page of the season
    initial_url = f'{url}/{currenet_season}/rower/2000'
    
    #getting splinter to visit the site and create a soup element to parse
    browser.visit(initial_url)
    html = browser.html
    soup = bs(html, 'lxml')
    
    #finding the number of pages for for each season
    page_numbers = int(soup.select("div > ul[class=pagination] > li")[11].text)
    
    #appending the first page to the dataframe
    twoK2021 = twoK2021.append(pd.read_html(initial_url)[1], ignore_index = True)
    
    #print record to keep track of the process
    print(f'{page_numbers} pages in {currenet_season}')
    print(f'starting to progess {currenet_season}')
    
    #the loop for the later pages
    for x in range (2, page_numbers + 1):
        #creating the search url for the later pages
        search_url = f'{url}/{currenet_season}/rower/2000?page={x}'
        
        #appending the pages to the dataframe
        twoK2021 = twoK2021.append(pd.read_html(search_url)[1], ignore_index = True)
        
        #printing record to keep track
        print(f'processing {currenet_season}, {x} of {page_numbers}')

#exit out of the browser
browser.quit()

For more information about the power of beautiful soup follow this link! https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Pandas

Here is an example of a function that I wrote that I used when I iterated through the dataframes to clean them with Pandas

#the function to clean the for time tests
def clean_forTime (df, distance, testname):
    #adding a column with the test name
    df['Test'] = testname
    
    #drop any rows that don't have a value in time
    df = df.dropna(subset=['Time'])
    
    #apply the mintocec function to the time column
    df['Seconds'] = df['Time'].apply(mintosec)
    
    #find the split time with the C2 formula
    df['Split'] = df['Seconds'].apply(lambda x: round(500*(x/distance), 2))
    
    #find the watts with the C2 formula
    df['Watts'] = df['Split'].apply(lambda x: round(2.8/(x/500)**3,2))
    
    #sort the dataframe by watts, highest value at the top
    df.sort_values(by = ['Watts'], ascending=False)
    
    #drop any unnecessary columns
    df = df.drop(columns=['Unnamed: 0', 'Pos.', 'Type'])
    
    #drop all of the rows that do not have a verified distance
    df = df[df['Verified'] != 'No']
    
    #send the dataframe to CSV with out an index
    df.to_csv(f'../CSVs/cleaned/{testname}.csv', index = False)
    
    #return the dataframe
    return df