n8wachT / Zeiver

A Scraper, Downloader, & Recorder for open directories

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Zeiver

Zeiver is designed to scrape and download content recursively from ODs (open directories). It also provides a means of recording links and scanning ODs for content.

*Zeiver does not download the entire OD itself, only the files.

For ease of use, check out the Zeiver configurator.

Table of Contents

Features

Zeiver currently has 4 major modules:

  • Grabber (HTTP)
    • Grabs content from the internet. (Webpage,files,etc)
  • Scraper
    • Recursively grabs all links from an OD.
  • Downloader
    • Downloads content retrieved from Scraper (or from a file)
  • Recorder
    • Saves a record of all files that were found in the OD
    • Records are saved to a file called URL_Records.txt. Name can be changed using --output-record
    • Creates stat files (JSON files containing statistical data about what was retrieved)

All components can be used independently.

Normal Workflow

The Grabber module repeatedly grabs a webpage for the Scraper to parse (based on parameters). The Scraper takes the webpage and recursively scrapes the links from them. Afterwards, the links are either sent to the Recorder (Disabled by default), specified with:

  • --record-only
  • --record

AND/OR

Downloader (Enabled by default). The Downloader uses the Grabber to download the files' data from the internet. the Downloader then writes the data to a newly created files.

More

  1. Uses asynchronous runtime.
  2. Random & fixed delays of HTTP requests.
  3. Ability to customize how files retrieved or not.
  4. Scans an OD for content while transparently displaying the traversal process.

Open Directory Support

Supported ODs can be found in OD.md.

Installation

  1. Install Rust.

    • If Rust is not installed, please follow the instructions here
  2. Once Rust is installed, open a CLI & type cargo install --branch main --git https://github.com/ZimCodes/Zeiver

    • This will install Zeiver from Github
  3. And that's it! To use Zeiver, start each command with zeiver.

Sample

The following code downloads files from example.com/xms/imgs, saves them in a local directory called Cool_Content, & sends a request with the ACCEPT-LANGUAGE header.

zeiver -h "accept-language$fr-CH, en;q=0.8, de;q=0.7" -o "./Cool_Content" example.com/xms/imgs

Commands

Positional

URLs...

Link(s) to the OD(s) you would like to download content from. *This is not needed if you are using -i, --input-file.


Options

General

-h, --help

Prints help information.

-V, --version

Prints version information

-v, --verbose

Enable verbose output

--test

Run a scrape test without downloading or recording.

--scan

Scan ODs

Scan ODs displaying their content to the terminal. A shortcut to activating --verbose & --test.


Download

-d, --depth

Specify the maximum depth for recursive scraping. Can also be used to traverse subpages (ODs with previous & next buttons). Default: 20. Depth of1 is current directory.

-A, --accept

Files to accept for scraping

Using Regex, specify which files to accept for scraping. Only the files that matches the regex will be acceptable for download. *This option takes precedence over --reject, -r.

Ex: zeiver -A "(mov|mp3|lunchbox_pic1\.jpg|(pic_of_me.gif))"

-R, --reject

Files to reject for scraping

Using Regex, specify which files to reject for scraping. Only the files that match the regex will be rejected for download. *--accept, -a takes precedence over this option.

Ex: zeiver -R "(jpg|png|3gp|(pic_of_me.gif))"


Recorder

--record

Activates the Recorder

Enables the Recorder which saves the scraped links to a file. *Option cannot be used with --record-only.

--record-only

Save the links only

After scraping, instead of downloading the files, save the links to them. *The downloader will be disabled when this option is active. Enables Recorder instead.

--output-record

Changes the name of the record file. This file is where the recorder will store the links. Default: URL_Records.txt

--no-stats

Prevents Recorder from creating _stat_ files.

The Recorder will no longer create _stat_ files when saving scraped links to a file. Default: false Ex: stat_URL_Record.txt

--no-stats-list

Prevent Recorder from writing file names to stat files

Stat files includes the names of all files in alphabetical order alongside the number of file extensions. This option prevents the Recorder from adding file names to stat files.


File/Directory

-i, --input-file

Read URLs from a file to be sent to the Scraper. *Each line represents a URL to an OD.

Ex: zeiver -i "./dir/urls.txt"

--input-record

Read URLs from an input file which contains links to other files and create a stats file based on the results. This option is for those who have a file filled with random unorganized links to a bunch of other files and want to take advantage of Zeiver's Recorder module. *Each line represents a URL to a file. Activates Recorder. Valid with --verbose, --output, --output-record

-o, --output

Save Directory.

The local directory path to save files. Files saved by the Recorder are also stored here. Default: ./

Ex: zeiver -o "./downloads/images/dir"

-c,--cuts

Ignores a specified number of remote directories from being created. *Only available when downloading. Default: 0

Ex: URL: example.org/pub/xempcs/other/pics

Original Save Location: ./pub/xempcs/other/pics

zeiver --cuts 2 www.example.org/pub/xempcs/other/pics

New Save Location: ./other/pics

--no-dirs

Do not create a hierarchy of directories structured the same as the URL the file came from. All files will be saved to the current output directory instead.

*Only available when downloading.


Grabber

--print-headers

Prints all Response Headers to the terminal

Prints all available Response headers received from each Request to the terminal. Option takes precedence over all other options!

--print-header

Prints a Response Header to terminal

Prints a specified Response Header to the terminal for each url. This Option takes precedence over all other options.

--https-only

Use HTTPS only

Restrict Zeiver to send all requests through HTTPS connections only.

-H, --headers

Sets the default headers to use for every request. *Must use the 'header$value' format. Each header must also be separated by a comma.

Ex: zeiver -H content-length$128,"accept$ text/html, application/xhtml+xml, image/webp"

-U

The User Agent header to use. Default: Zeiver/VERSION

-t, --tries

The amount of times to retry a failed connection/request. Default: 20

-w, --wait

Wait a specified number of seconds between each scraping & download requests.

--retry-wait

The wait time between each failed request. Default: 10

--random-wait

Wait a random amount of seconds between each request.

The time between requests will vary between 0.5 * --wait,-w (inclusive) to 1.5 * --wait,-w (exclusive)

-T, --timeout

Adds a request timeout for a specified number of seconds.

-r, --redirects

Maximum redirects to follow. Default: 10

--proxy

The proxy to use.

Ex: zeiver --proxy "socks5://192.168.1.1:9000"

--proxy-auth

The basic authentication needed to use the proxy. *Must use the 'username:password' format.

--all-certs

Accepts all certificates (Beware!)

Accepts all certificates even invalid ones. Use this option at your own risk!


Extra Info

URL is too long

Having trouble entering a long URL in the terminal? Place them inside an input file and use --input-file instead.

Can't access an OD because of certificates

Trying using the --all-certs option, but be wary with this option.

Content from OD exists, however Zeiver isn't scraping/recording/downloading/scouting any of them

Some ODs will send Zeiver HTML Documents without any content (files/folders) from the OD. This is because Zeiver retrieves an HTML Document without JavaScript & some ODs will not work without it.


License

Zeiver is licensed under the MIT and Apache 2.0 Licenses.

See the MIT and Apache-2.0 for more details.

About

A Scraper, Downloader, & Recorder for open directories

License:Apache License 2.0


Languages

Language:Rust 100.0%