ndmen / scrapingbakery

This repository contains the implementation of a web scraping API designed to retrieve product information from a specified URL. The API is built using NestJS and employs asynchronous processing to handle requests efficiently.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nest Logo

A progressive Node.js framework for building efficient and scalable server-side applications.

NPM Version Package License NPM Downloads CircleCI Coverage Discord Backers on Open Collective Sponsors on Open Collective Support us

Description

This repository contains the implementation of a web scraping API designed to retrieve product information from a specified URL. The API is built using NestJS and employs asynchronous processing to handle requests efficiently.

Features

  • Receives requests containing a product ID and initiates asynchronous processing.
  • Responds with an HTTP 200 status code and a unique process identifier upon request reception.
  • Initiates the scraping process of the target URL and transforms the website data into a unified JSON format.
  • Includes a 10-second timeout to simulate data processing.
  • Responds with a "not ready" status if queried with the process identifier during the timeout period.
  • Provides the final result via the same endpoint after the processing is complete.

Note

  • This project uses NestJS and cache for processing purposes. In a real-world scenario, Redis would be used for processing, and PostgreSQL for storing results.

Data Retrieval Methods

To retrieve product information as per the requirements outlined in the task, the following methods were considered:

  1. Open Graph in Meta Tags: Parsing meta tags with Open Graph protocol to extract product information.

  2. Schema Parsing: Extracting product details from structured data using schema markup.

  3. HTML Markup Parsing: Parsing HTML markup to identify and extract product information.

  4. Script Tag Parsing: Extracting data from JavaScript scripts embedded within the HTML.

For the given task, the preferred method of data retrieval was Script Tag Parsing. This method was chosen because it provided the necessary information required by the task. Specifically, it allowed for the extraction of product identifiers and specifications required for further processing.

Installation

$ npm install

Running the app

# watch mode
$ npm run start:dev

Using documentation

Open swagger http://localhost:3000/swagger/#/scraper/ScraperController_scrapeProduct and try to send post method with data:

{
  "productId": "air-presto-mens-shoes-JlLlWz"
}

Support

Nest is an MIT-licensed open source project. It can grow thanks to the sponsors and support by the amazing backers. If you'd like to join them, please read more here.

Stay in touch

License

Nest is MIT licensed.

About

This repository contains the implementation of a web scraping API designed to retrieve product information from a specified URL. The API is built using NestJS and employs asynchronous processing to handle requests efficiently.


Languages

Language:TypeScript 93.2%Language:JavaScript 6.8%