Leo200467 / WebScrapingAndScikitTraining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WebScrapingAndScikitTraining

This repository contains training files for Webscraping, Pandas and some use of SciKit-Learn.

Some of the libraries used are:

  • Beautiful Soup for webscrapping
  • Pandas for Dataframe building, modifications, transformations and data preparation
  • SQLite 3 as local database for storing scrapped data

Overview of files

This notebook uses Beautiful Soup to scrape data from Books to Scrape, a sandbox website for scrape training. The data scraped was books from website catalogue ranging from page 1 until page 50 and can be found here. The objective was learn scrapping basics and save data into a database, later, this evolved to a attempt to predict prices.

This notebook was used to advance in the attempt of predicting book prices with scrapped data. The objective was to try out methods like Random Forest Regressor to predict book pricing using the following features: number of pages, rating and pricing of other books. Some models were built, but the result wasn't very good with Mean Absolute Error ranging from $11 to $13.

This notebook was used in a Kaggle Challenge, Housing Prices Competition for Kaggle Learn. The goal was to build and tune a Machine Learning Model for housing price prediction using a dataset provided by Kaggle. In this specific lesson, encoding and categorical data were the main subjects. The objective was to learn more on pricing prediction methods and feature engineering.

Both notebooks were used to learn more on how to work with XML data. Parsing, inserting, modeling and transforming. Then, pushing the data to SQLite3.

About


Languages

Language:Jupyter Notebook 100.0%