BatuhanW / ruby-scraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tiny Restful API for parsing h1, h2, h3 and links for given url.

Endpoints

GET 'v1/indexes'

POST 'v1/indexes', url: 'https://www.google.com'

Index Model

default: <Index id: nil, url: "", content: {"h1"=>"", "h2"=>"", "h3"=>"", "links"=>[]}, status: "pending", created_at: nil, updated_at: nil>

indexed: <Index id: 583916550, url: "https://www.wikipedia.com", content: {"h1"=>"Heading 1", "h2"=>"Heading 2", "h3"=>"Heading 3", "links"=>["https://www.google.com", "https://www.wikipedia.com"]}, status: "indexed", created_at: "2017-01-26 06:58:56", updated_at: "2017-01-26 06:58:56">

Flow

  • Posting indexes url, will create Index record on the database with given url.

  • Index model will trigger IndexerJob to parse URL's elements, save to content json and update the status.

  • Indexer Job will parse html with ParserService.

About


Languages

Language:Ruby 99.3%Language:HTML 0.7%