WIP.
Crawl websites to capture the errors and browser reports they produce.
When using the default API endpoint, you can view the results at https://dash.trackx.app/projects/trackx-crawler or https://demo.trackx.app/projects/trackx-crawler.
- 2 modes:
- Run against an imported list of websites, loading the home page and optionally deeper
- Crawl all pages within a target website/s
TODO: Also need to create the config; show copy/paste-able sample
pnpx trackx-crawler --help
pnpx trackx-crawler run --max 100
- Install dependencies:
pnpm install
- Build the app:
pnpm run build
You can use any list of websites, however, in this example we use the top 10 million visited domains from Open Page Rank. This is an excellent and free resource that's updated roughly every 3 months.
- Download the most visited websites CSV from Open Page Rank (direct link to zip).
- Unzip the file.
- Format CSV into a list of websites separated by newline (using the sqlite3 CLI tool):
In the SQLite shell run each command:
sqlite3 --noheader --cmd ".eqp off"
sqlite> .import --csv top10milliondomains.csv site sqlite> .output ./site-list.txt sqlite> .mode list sqlite> SELECT Domain FROM site LIMIT 100000; sqlite> .exit
- Import the line delimited list into a new crawler database:
txc import site-list.txt
- Edit
crawler.config.json
as necessary. - Run the crawler:
txc run --max 100
To see all txc run
options run:
txc run --help
or to see all available actions run:
txc --help
Normally a TrackX API instance runs behind a reverse proxy server like Nginx or HAProxy for rate limiting. While this is ideal for real world use cases, the crawler is a special exception where we don't want rate limiting.
It's normal to generate a lot of requests to the TrackX API, so we need a way around the rate limiting proxy. One way to achieve this, in a secure way, is to use SSH port forwarding.
To forward port 8000
from your remote server to port 8888
on your local machine:
ssh -L 8888:localhost:8000 user@yourserver
Then update crawler.config.json
:
"API_ENDPOINT": "http://127.0.0.1:8888/v1/xxxxxxxxxxx",
Report any bugs you encounter on the GitHub issue tracker.
MIT license. See LICENSE.
© 2022 Max Milton