A web crawler used to obtain information about books from https://www.bookdepository.com/ when a barcode (isbn13) is given.
The output and input files are located in the data/ directory.
The input file (inputIsbn13.txt
) contains isbn13 barcodes separated by a newline.
The output file (output.txt
) contains all of the results of the web search from https://www.bookdepository.com/.
Frameworks/packages used:
-
Ensure that Node.js is installed
Install at least v12.16.1 or higher as this is the oldest active LTS version Only releases that are or will become an LTS release are officially supported
-
Clone the repo
-
Navigate to the cloned repo and run the following command in the terminal:
npm i npm start
-
The current web crawler works with the Chrome browser.
The following information is stored in JSON format:
- Barcode (isbn13)
- Format
- Dimensions
- Publication Date
- Publisher
- Imprint
- Publication Country
- Language
- Edition Statement
- isbn10
- isbn13
- Bestseller Rank
- Description
Example of how information is parsed from https://www.bookdepository.com/ using regex
format = 'Format Paperback | 560 pages'
format.match(new RegExp(/\d+\s.+/))
>> Array [ "560 pages" ]
dimensions = 'Dimensions 129 x 198 x 24mm | 383g'
dimensions.match(new RegExp(/\d+.+/))
>>Array [ "129 x 198 x 24mm | 383g" ]
publicationDate = 'Publication date 01 Sep 2015'
publicationDate.match(new RegExp(/\d{2} \w{3} \d{4}/))
>> Array [ "01 Sep 2015" ]
publisher = 'Publisher Penguin Books Ltd'
publisher.match(new RegExp(/[^Publisher].+/))
>> Array [ " Penguin Books Ltd" ]
imprint = 'Imprint PENGUIN CLASSICS'
imprint.match(new RegExp(/[^Imprint].+/i))
>>> Array [ " PENGUIN CLASSICS" ]
publicationCountry = 'Publication City/Country London, United Kingdom'
publicationCountry.match(new RegExp(/[^(?!Publication City\/Country )].+/))
>> Array [ "London, United Kingdom" ]
language = 'Language English'
language.match(new RegExp(/[^(?!language)].+/i))
>> Array [ " English" ]
editionStatement = 'Edition Statement UK ed.'
editionStatement.match(new RegExp(/[^(?!edition statement)].+/i))
>> Array [ "UK ed." ]
isbn10 = 'ISBN10 024120013X'
isbn10.match(new RegExp(/[^(?!isbn10)].+/i))
>> Array [ " 024120013X" ]
isbn13 = 'ISBN13 9780241200131'
isbn13.match(new RegExp(/[^(?!isbn13)].+/i))
>> Array [ " 9780241200131" ]
bestsellerRank = 'Bestsellers rank 7,918'
bestsellerRank.match(new RegExp(/[^(?!bestsellers rank)].+/i))
>> Array [ "7,918" ]
Replacing multiple newlines to a single newline
description.replace(/[\r\n\s]{2,}/g,"\n")
results.forEach(result => {
result.description = result.description.replace(/[\r\n\s]{2,}/g, "\n");
result.description = result.description.replace(/[\r\n\s]*(show more)[\r\n\s]*$/, "");
result.description = result.description.trim();
});
description
>> "
Description
With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture.
show more
";
description.match(new RegExp(/[^(?!\n\sdescription)].+/i))
Array [ "With its astounding hardcover reviews Richard Zenith's new complete translation of THE BOOK OF DISQUIET has now taken on a similar iconic status to ULYSSES, THE TRIAL or IN SEARCH OF LOST TIME as one of the greatest but also strangest modernist texts. An assembly of sometimes linked fragments, it is a mesmerising, haunting 'novel' without parallel in any other culture. " ]
Regular Expressions (RegEx) in 100 Seconds