Jiashuo-Sun / DoubanBookList

Use Web Crawler to scrape book infomation in Douban Book and Rank for the top-review-score booklist.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Douban Book List

Douban Book is the largest book review website in China. In the website of particular book, it has the book name, auther, internatianal standard book number (ISBN), average review score, and review number.

webpage

In this project, I use web crawler to grab book infomation from Douban Book and clean & merge books from different versions. Rank the top 500 high-score booklist.

Project Structure

Structure

You can check some book list results in BookList.

Main datasets

You can find these datasets in data.

  1. Book Link Set

Only record book links.

  1. Book Info Set (Main)

The main dataset for this project. It contains Book Name, Author, ISBN, Average Review Score, Review Number, URL, and Update Date.

  1. Clean Book Info

Drop books without enough review number and clean the book infomation from BookInfoSet.csv and merger the books from different versions.

  1. Book Lists

All the filtered book lists are stored in the format of "BookList-MinAvg-MinNum-Date.csv". The books are sorted by the average review score and the number of review. By inputing the minimum average review score and minimun review number, booklist.py can generate the specific book list by using MySQL database.

  1. History data of Clean Book Info

Since I'm still scraping data from website daily, I will update the CleanBookInfo dataset every day and save it in the history_data for future filtering.

I export one data list as a PDF file, you can find a completed one here

Happy Reading!

(I'm still working on this! Keep Updating~~)

About

Use Web Crawler to scrape book infomation in Douban Book and Rank for the top-review-score booklist.


Languages

Language:Python 100.0%