gioele-antoci / polo-crawler

Crawler to collect analysis on distribution of HTML tags in a webpage, as well as max depth in DOM and other useful data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

polo-crawler

This repository serves as initial research for a web-based game called Polo.

Technology used

  • Node v7.0 with harmony flag for async/await use
  • Typescript as programming language
  • Firebase as schemaless database

What it is doing

Through 3rd party libraries ( BIG shoot out to himalaya) and require I am crawling websites and storing some informations about their HTML content. The data is stored in a real-time Google powered database, Firebase.
The data has the following structure:

type siteAnalytics = {
    website: string,
    maxDepth: number,
    elements: { [tag: string]: number },
    hrefObjs: string[],
    childrenCount: { [count: number]: number },
    isDeadEnd: boolean
};

maxDepth represents the deepest hierarchical level in the DOM for a website.
Elements is a dictionary with key an HTML tag (e.g. a, div, ...) and value the occorrunces of such tag in the page website. Only W3C valid tags are included.
childrenCount is another dictionary. Its key represents the number of children a node has. The value is instead how many nodes have x children (with x being the key).
hrefObjs is an array of hrefs found in the page. Those are handy in the game that will be developed.
isDeadEnd is true if no new hrefs are found.

New hrefs are in fact stored in an array (in the application scope). When we crawl a website we look at all the hrefs found and if subDomain + domain has not been found yet, we add it to an array of hrefs yet to crawl. Otherwise, they get dismissed.
Until there are new hrefs we keep parsing.

About

Crawler to collect analysis on distribution of HTML tags in a webpage, as well as max depth in DOM and other useful data.


Languages

Language:TypeScript 100.0%