web-crawler

A web crawler used to obtain information about books from https://www.bookdepository.com/ when a barcode (isbn13) is given.

The output and input files are located in the data/ directory. The input file (inputIsbn13.txt) contains isbn13 barcodes separated by a newline. The output file (output.txt) contains all of the results of the web search from https://www.bookdepository.com/.

Frameworks/packages used:

Setup:

Ensure that Node.js is installed

Install at least v12.16.1 or higher as this is the oldest active LTS version Only releases that are or will become an LTS release are officially supported
Clone the repo
Navigate to the cloned repo and run the following command in the terminal:
```
npm i
npm start
```
The current web crawler works with the Chrome browser.

Storage of information

The following information is stored in JSON format:

Barcode (isbn13)
Format
Dimensions
Publication Date
Publisher
Imprint
Publication Country
Language
Edition Statement
isbn10
isbn13
Bestseller Rank
Description

Example of how information is parsed from https://www.bookdepository.com/ using regex

Format:

format = 'Format Paperback | 560 pages'
format.match(new RegExp(/\d+\s.+/))
>> Array [ "560 pages" ]

Dimensions:

dimensions = 'Dimensions 129 x 198 x 24mm | 383g'
dimensions.match(new RegExp(/\d+.+/))
>>Array [ "129 x 198 x 24mm | 383g" ]

Publication Date

publicationDate = 'Publication date 01 Sep 2015'
publicationDate.match(new RegExp(/\d{2} \w{3} \d{4}/))
>> Array [ "01 Sep 2015" ]

Publisher

publisher = 'Publisher Penguin Books Ltd'
publisher.match(new RegExp(/[^Publisher].+/))
>> Array [ " Penguin Books Ltd" ]

Imprint

imprint = 'Imprint PENGUIN CLASSICS'
imprint.match(new RegExp(/[^Imprint].+/i))
>>> Array [ " PENGUIN CLASSICS" ]

Publication Country

publicationCountry = 'Publication City/Country London, United Kingdom'
publicationCountry.match(new RegExp(/[^(?!Publication City\/Country )].+/))
>> Array [ "London, United Kingdom" ]

Language

language = 'Language English'
language.match(new RegExp(/[^(?!language)].+/i))
>> Array [ " English" ]

Edition Statement

editionStatement = 'Edition Statement UK ed.'
editionStatement.match(new RegExp(/[^(?!edition statement)].+/i))
>> Array [ "UK ed." ]

isbn10

isbn10 = 'ISBN10 024120013X'
isbn10.match(new RegExp(/[^(?!isbn10)].+/i))
>> Array [ " 024120013X" ]

isbn13

isbn13 = 'ISBN13 9780241200131'
isbn13.match(new RegExp(/[^(?!isbn13)].+/i))
>> Array [ " 9780241200131" ]

Bestseller Rank

bestsellerRank = 'Bestsellers rank 7,918'
bestsellerRank.match(new RegExp(/[^(?!bestsellers rank)].+/i))
>> Array [ "7,918" ]

Description

Replacing multiple newlines to a single newline

description.replace(/[\r\n\s]{2,}/g,"\n")

results.forEach(result => {
   result.description = result.description.replace(/[\r\n\s]{2,}/g, "\n");
   result.description = result.description.replace(/[\r\n\s]*(show more)[\r\n\s]*$/, "");
   result.description = result.description.trim();
});

description
>> "
                Description


                    With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.
                    show more

            ";

description.match(new RegExp(/[^(?!\n\sdescription)].+/i))
Array [ "With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.    " ]