CaptainScraper is a NodeJs web scraper framework. It allows developers to build simple or complex scrapers in a minimum of time. Take the time to discover these features!
Install the following:
- NodeJs (>=5)
- MongoDb
- Typescript (npm)
- ts-node (npm)
Clone the repository and install the required modules :
git clone git@github.com:andrewdsilva/CaptainScraper.git
cd vendor/
npm install
cd ..
Install the following:
- Docker : https://docs.docker.com/engine/installation/
- Docker Compose : https://docs.docker.com/compose/install/
Build an image of CaptainScraper from the Dockerfile :
At the command line, make sure the current directory is the root of CaptainScraper project, where the Dockerfile is.
docker build -t captainscraper:2.0 .
Now you can run a terminal on the Docker with Docker Compose :
docker-compose run app bash
# Manually start mongo database (if you are not using docker)
bash app/startDatabase.sh
# Execute a script located at /src/Sample/Allocine/Controller/AllocineCinemas.ts
ts-node app/console script:run Sample/Allocine/AllocineCinemas
# Equivalent
ts-node app/console script:run Sample/Allocine/Controller/AllocineCinemas
# Execute a script using docker-compose
docker-compose run app ts-node app/console script:run Sample/Allocine/AllocineCinemas
A controller is a class with a function execute that contains the main logic of your program. Every scraper has a controller. This is an example of controller declaration :
import { Controller } from '../../../../vendor/captainscraper/framework/Controller/Controller';
class MyFirstController extends Controller {
public execute(): void {
console.log( 'Hello world!' );
}
}
export { MyFirstController };
A parser is a function you create that reads information from a web page. There is several kind of parsers, for example HtmlParser allow you to parse the page with cheerio that is an equivalent of jQuery.
import { HtmlParser } from '../../../../vendor/captainscraper/modules/Scraper/Parser/HtmlParser';
class MyFirstParser extends HtmlParser {
public name: string = 'MyFirstParser';
public parse( $: any, parameters: any ): void {
/* Finding users on the page */
$( 'div.user' ).each(function() {
console.log('User found : ' + $( this ).text();
});
}
}
export { MyFirstParser };
To load a page we use the addPage function of the Scraper module. In a controller you can get a module like this :
let scraperModule : any = this.get('Scraper');
In a parser you can get the Scraper module with the parent attribut of the class. This attribut references the instance of Scraper that call the parser.
let scraperModule : any = this.parent;
Then you can call the addPage function with some parameters. This operation will be queued!
let listPageParameters: any = {
url : 'https://www.google.fr',
parser : MyParser
};
scraperModule.addPage( listPageParameters );
This is a suggestion to organize your project, with separate folder for controllers and parsers.
captainscraper/
├─ app/
├─ src/
│ └─ MyProject/
│ └─ Controller/
│ └─ MyController.ts
│ └─ Parser/
│ └─ MyFirstParser.ts
│ └─ MySecondParser.ts
├─ vendor/
Coming soon...