Cydral / FFspider

Multi-threaded web crawler and data extraction tool.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FFspider

image

FFspider is a powerful web crawling and data extraction tool written in C++ using the Boost library. It allows you to efficiently crawl websites, extract valuable information, and perform various data processing tasks. Whether you need to scrape data, monitor websites for changes, or build your own web spider, FFspider provides a flexible and customizable solution.

Description

FFspider is a multi-threaded web crawler and data extraction tool designed for iterative site discovery. It is an all-in-one file that allows users to initiate site crawling starting from a provided URL. Please note that this crawler is relatively simple and does not meet the criteria of a "polite crawler" as it does not handle for instance load management on the target sites.

Initially created for internal needs, FFspider was developed to assist in the creation of image databases for AI research projects and the creation and the relevance evaluation of CNN models.

Disclaimer

Please exercise caution and adhere to privacy and legal regulations when using web scraping capabilities. Respect the terms of service and privacy policies of the websites you crawl, and ensure that you have the necessary permissions and rights to access and process the data. It is your responsibility to use this tool in a responsible and ethical manner.

Features

  • Multi-threaded crawling for improved performance.
  • Flexible data extraction using CSS selectors and XPath.
  • TODO: Support for handling JavaScript-rendered pages using headless browsers.
  • Extensible architecture for adding custom data processing and storage options.
  • In-memory object database system to maximize performance during the crawling and processing of images.
  • Automatically image storing during the crawling process to a local cache directory for future reuse.
  • Configurable options for controlling crawling behavior.

Installation

Prerequisites

The FFspider program has several external dependencies that need to be installed before use:

  • Gumbo: Gumbo is a library used for parsing HTML and extracting information from web pages.
  • Sqlite_orm: Sqlite_orm is a lightweight header-only C++ library for easy object-relational mapping (ORM) with SQLite.
  • Boost: Boost provides various libraries for C++ programming, including utilities, algorithms, and data structures.
  • Dlib: Dlib is a general-purpose cross-platform C++ library that includes machine learning algorithms and tools for image processing.
  • Cpr: Cpr is a C++ library for making HTTP requests.

Please make sure to install these dependencies before proceeding with the FFspider program.

Building FFspider

FFspider is primarily designed to run on the Microsoft Windows platform (Windows 10 and above). The compilation process has been tested using Microsoft Visual Studio 2022. To build FFspider, please follow these steps:

  1. Clone the FFspider repository from GitHub.
  2. If you haven't already, install the vcpkg package manager: git clone https://github.com/Microsoft/vcpkg.git
  3. Integrate vcpkg with Visual Studio: vcpkg integrate install
  4. Install the required packages using vcpkg. For example, to install Gumbo and Sqlite_orm, use the following commands: vcpkg install gumbo:x64-windows
  5. Create or open the project in Microsoft Visual Studio 2022.
  6. It is recommended to compile FFspider in x64 mode for optimal performance and compatibility.
  7. Build the FFspider project.

After successful compilation and execution, you will be able to use FFspider for crawling websites and processing/storing images on a local machine.

License

FFspider is released under the MIT License.

Contact

For any questions or feedback, feel free to reach out to us through the GitHub repository.

Happy crawling with FFspider!

About

Multi-threaded web crawler and data extraction tool.

License:MIT License


Languages

Language:C++ 100.0%