Allows to "download" a page. This is usefull in few cases:
- there's no access to the "admin panel" for whatever reason,
- and old page is about to vanish,
- need to access the resources in offline mode,
- in worst case, malicious idea of copying/reusing other people's hard work :(
The main export is an asynchronous function pageRipper
, which requires configuration object:
dataPath
should be folder, in which folders for each page will be created,dbPath
should be folder, in which sqlite db file will be stored,parsePost
should be a function that will extract information from the page html,startingPages
optional string array, that should be url of the first page to crawl.requestPause
optional number, that will describe pause between each page request.
Function that must be passed in configuration. It will be used to extract information from the retrieved post html.
It will accept three arguments:
cheerio
instance with loaded post body,- post url,
- raw response html body.
It must return an object, with optional properties:
id
, uniquified if it already exists by appending__(count)
to it,folderName
, name of dir in which assets will be stored; if it's missing, no assets will be downloaded,nextUrls
, array of urls, that be added to the queue and persisted in the database,imageUrls
array of urls for images.
The object can contain any other properties, they will be all persisted in the database.
- Download this repo
- Install dependencies:
npm i
. - In file
config.js
provide valid config. - Execute
start
:npm run start
.
This app is currently unavailable in NPM registry.
- Move all
normalizations
of urls to filenames to the crawler, instead of relying on the validparsePost
result. - Decide what should be
unique
key in the posts array,url
orid
?
- if
id
, then how it can by forsed/ensured to be unique by the parser? - if
url
, then how should be treated query parameters, ports, protocols? Which part of url should be "unique"? - maybe both fields should combine into unique property?