This project is a boilerplate for web scraping both JSON APIs and webpages. It uses SQLite as the database for storing scraped mortgage information, and it employs modern libraries and best practices for robust and maintainable web scraping.
Ensure you have the following installed on your machine:
- Node.js (version 14.x or above)
- pnpm (package manager)
-
Clone the repository:
git clone https://github.com/yourusername/webscrape-boilerplate.git cd webscrape-boilerplate
-
Install dependencies:
pnpm install
-
Create the SQLite database:
pnpm createDB
To run the scraping tasks immediately, use the start:now
script:
pnpm start:now
This will compile the TypeScript code and execute the scraping tasks right away.
By default, the scraping tasks are scheduled to run at specific intervals using node-schedule
:
- JSON scraping task: scheduled to run every day at midnight.
- Webpage scraping task: scheduled to run every day at 1 AM.
To start the service with its scheduling:
pnpm start
src/database.ts
: Sets up the SQLite database connection and provides functions to insert and retrieve mortgage information.src/createDB.ts
: Initializes and exports the SQLite database instance.src/index.ts
: Schedules the scraping tasks usingnode-schedule
and provides an option to run them immediately.src/logger.ts
: Configures thewinston
logger for logging messages and errors.src/scraper-json.ts
: Fetches mortgage information from a JSON API, retries on failure, and inserts the data into the SQLite database.src/scraper-webpage.ts
: Scrapes mortgage information from a webpage using Playwright and X-Ray, then inserts the data into the SQLite database.
The database is stored as db/sqlite3.db
by default. Logging is configured to write error logs to error.log
and all logs to combined.log
.
If you need to scrape different data or use another API, you can modify the following:
- API URL: Change the
API_URL
insrc/scraper-json.ts
to point to your desired API. - Webpage URL: Change the
PAGE_URL
insrc/scraper-webpage.ts
to point to your target webpage. - Data Structure: Adjust the data extraction fields within
fetchMortgageInfo()
andfetchMortgageInfoFromWebpage()
functions to match the structure of your source data.
The project uses winston
for logging:
- Errors are logged to
error.log
. - All logs are combined and saved to
combined.log
.
Logs are also output to the console when the environment is not set to production.
Feel free to contribute to this project by submitting issues or pull requests on the GitHub repository.
This project is licensed under the ISC License. See the LICENSE file for details.
This should get you started with the web scraping boilerplate. Customize it as needed, and happy scraping!