JeremyLoh / web-crawler

A web crawler used to obtain information about books from https://www.bookdepository.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

web-crawler

A web crawler used to obtain information about books from https://www.bookdepository.com/ when a barcode (isbn13) is given.

The output and input files are located in the data/ directory. The input file (inputIsbn13.txt) contains isbn13 barcodes separated by a newline. The output file (output.txt) contains all of the results of the web search from https://www.bookdepository.com/.

Frameworks/packages used:

Setup:

  1. Ensure that Node.js is installed

    Install at least v12.16.1 or higher as this is the oldest active LTS version Only releases that are or will become an LTS release are officially supported

  2. Clone the repo

  3. Navigate to the cloned repo and run the following command in the terminal:

    npm i
    npm start
    
  4. The current web crawler works with the Chrome browser.

Storage of information

The following information is stored in JSON format:

  1. Barcode (isbn13)
  2. Format
  3. Dimensions
  4. Publication Date
  5. Publisher
  6. Imprint
  7. Publication Country
  8. Language
  9. Edition Statement
  10. isbn10
  11. isbn13
  12. Bestseller Rank
  13. Description

Example of how information is parsed from https://www.bookdepository.com/ using regex

Format:

format = 'Format Paperback | 560 pages'
format.match(new RegExp(/\d+\s.+/))
>> Array [ "560 pages" ]

Dimensions:

dimensions = 'Dimensions 129 x 198 x 24mm | 383g'
dimensions.match(new RegExp(/\d+.+/))
>>Array [ "129 x 198 x 24mm | 383g" ]

Publication Date

publicationDate = 'Publication date 01 Sep 2015'
publicationDate.match(new RegExp(/\d{2} \w{3} \d{4}/))
>> Array [ "01 Sep 2015" ]

Publisher

publisher = 'Publisher Penguin Books Ltd'
publisher.match(new RegExp(/[^Publisher].+/))
>> Array [ " Penguin Books Ltd" ]

Imprint

imprint = 'Imprint PENGUIN CLASSICS'
imprint.match(new RegExp(/[^Imprint].+/i))
>>> Array [ " PENGUIN CLASSICS" ]

Publication Country

publicationCountry = 'Publication City/Country London, United Kingdom'
publicationCountry.match(new RegExp(/[^(?!Publication City\/Country )].+/))
>> Array [ "London, United Kingdom" ]

Language

language = 'Language English'
language.match(new RegExp(/[^(?!language)].+/i))
>> Array [ " English" ]

Edition Statement

editionStatement = 'Edition Statement UK ed.'
editionStatement.match(new RegExp(/[^(?!edition statement)].+/i))
>> Array [ "UK ed." ]

isbn10

isbn10 = 'ISBN10 024120013X'
isbn10.match(new RegExp(/[^(?!isbn10)].+/i))
>> Array [ " 024120013X" ]

isbn13

isbn13 = 'ISBN13 9780241200131'
isbn13.match(new RegExp(/[^(?!isbn13)].+/i))
>> Array [ " 9780241200131" ]

Bestseller Rank

bestsellerRank = 'Bestsellers rank 7,918'
bestsellerRank.match(new RegExp(/[^(?!bestsellers rank)].+/i))
>> Array [ "7,918" ]

Description

Replacing multiple newlines to a single newline

  • description.replace(/[\r\n\s]{2,}/g,"\n")
results.forEach(result => {
   result.description = result.description.replace(/[\r\n\s]{2,}/g, "\n");
   result.description = result.description.replace(/[\r\n\s]*(show more)[\r\n\s]*$/, "");
   result.description = result.description.trim();
});
description
>> "
                Description


                    With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.
                    show more

            ";

description.match(new RegExp(/[^(?!\n\sdescription)].+/i))
Array [ "With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.    " ]

References

How to Setup WebdriverIO

WebdriverIO Selectors

Regular Expressions (RegEx) in 100 Seconds

Will Brock - 09 Selecting elements on a page - WebdriverIO

WebdriverIO setTimeout

Node.js - How do i write files in Node.js?

About

A web crawler used to obtain information about books from https://www.bookdepository.com/


Languages

Language:JavaScript 100.0%