Douban Book is the largest book review website in China. In the website of particular book, it has the book name, auther, internatianal standard book number (ISBN), average review score, and review number.
In this project, I use web crawler to grab book infomation from Douban Book and clean & merge books from different versions. Rank the top 500 high-score booklist.
You can check some book list results in BookList.
You can find these datasets in data.
Only record book links.
The main dataset for this project. It contains Book Name, Author, ISBN, Average Review Score, Review Number, URL, and Update Date.
Drop books without enough review number and clean the book infomation from BookInfoSet.csv and merger the books from different versions.
All the filtered book lists are stored in the format of "BookList-MinAvg-MinNum-Date.csv". The books are sorted by the average review score and the number of review. By inputing the minimum average review score and minimun review number, booklist.py can generate the specific book list by using MySQL database.
Since I'm still scraping data from website daily, I will update the CleanBookInfo dataset every day and save it in the history_data for future filtering.
I export one data list as a PDF file, you can find a completed one here
Happy Reading!
(I'm still working on this! Keep Updating~~)