NhokCrazy199 / prowac

Progressive Web App Crawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Description

PROWAC, or Progressive Web App Crawler, crawls a specified set of sites and checks for various indicators that those sites are Progressive Web Apps. The technologies that we are interested in checking for include Web App Manifest, Service Workers, W3C Push API, and others.

Design

The project is separated into 3 pieces: Crawler, Dashboard, Data Store. In the source directory, you will notice that there are 3 subdirectories named crawler, dashboard, and dataStore.

Crawler - top/crawler

  • main.js - This is the source of the main driver for the crawler. It uses the urlJobPopulator module to populate its list of jobs and uses the urlJobProcessor module to process each of those jobs. It stores the output of each job using the dataStore module.
  • urlJobPopulator.js - This module is responsible for supplying the website URLs that the crawler should examine. This module selects and loads a backend from the urlJobPopulatorBackends/ directory and delegates its work to the backend.
  • urlJobPopulatorBackends/ - This directory contains the backends available for populating URL jobs. The most important is alexa.js which fetches the Alexa top 1 million sites list and populates the url job list from that data.
  • urlJobProcessor.js - This module is responsible for fetching site data and running probes on each URL job. The results get passed to the caller. See the top-level description of the project for examples of what probes might check for.

Dashboard - top/dashboard

This section is in progress.

DataStore - top/dataStore

  • dataStore.js - This module is responsible for providing an API that allows data to be stored from the output of processing URL jobs, and that allows data to be retrieved by the dashboard. This module loads a backend and delegates data storage and retrieval to the backend.
  • backend-*.js - These are the backends available for the main data store module to load. Each of these files implements the same interface using a different storage technology. Examples include storing data in memory and storing data in a redis DB.

Building and Running

npm install in the source directory should get you all the npm modules you need to get started.

npm run build will transpile/copy all the necessary files into the dist/ directory.

npm start will start both the crawler and the dashboard, using appropriate defaults.

npm test will run the project test suite.

Note: you'll want to install redis and start the redis server before you run the crawler or the dashboard with their default settings. The crawler uses kue for keeping track of url job tasks and kue requires a redis db to be up and running. The default backend for the dataStore module is the redis backend, which obviously also requires a redis server.

./node_modules/kue/bin/kue-dashboard will start a server that you can connect to in a browser to see the status of any current url jobs. If you're using alternative redis settings, you'll have to pass those along to kue-dashboard on the command line.

About

Progressive Web App Crawler

License:Mozilla Public License 2.0


Languages

Language:JavaScript 92.6%Language:HTML 7.2%Language:CSS 0.2%