Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.
Features:
- Written in Python: lightweight & robust
- Familiar Wget options and behavior
- Graceful stopping and resuming
- Python & Lua scripting support
- Modular, extensible, & asynchronous API
- PhantomJS integration
Currently in beta quality! Some features are not implemented yet and the API is not considered stable.
Requires:
- Python 2.6, 2.7, 3.2, 3.3 (or newer)
- Tornado
- Toro
- lxml
- chardet
- BeautifulSoup4
- SQLAlchemy
- Lunatic Python (bastibe version) (optional for Lua support)
- PhantomJS (optional)
Once you install the requirements, install Wpull from PyPI using pip:
pip3 install wpull
For detailed installation instructions, please see http://wpull.readthedocs.org/en/master/install.html.
To download the About page of Google.com:
wpull google.com/about
To archive a website:
wpull billy.blogsite.example --warc-file blogsite-billy \ --no-check-certificate \ --no-robots --user-agent "InconspiuousWebBrowser/1.0" \ --wait 0.5 --random-wait --waitretry 600 \ --page-requisites --recursive --level inf \ --span-hosts --domains blogsitecdn.example,cloudspeeder.example \ --hostnames billy.blogsite.example \ --reject-regex "/login\.php" \ --tries inf --retry-connrefused --retry-dns-error \ --delete-after --database blogsite-billy.db \ --quiet --output-file blogsite-billy.log
To see all options:
wpull --help
Documentation is located at http://wpull.readthedocs.org/.
Need help? Please see our Help page which contains frequently asked questions and support information.
The issue tracker is located at https://github.com/chfoo/wpull/issues.
Contributions and feedback are greatly appreciated.
Copyright 2013-2014 by Christopher Foo. License GPL v3.
This project contains third-party source code licensed under different terms:
- backport
- wpull.backport.argparse
- wpull.backport.collections
- wpull.backport.functools
- wpull.backport.tempfile
- wpull.backport.urlparse
- wpull.thirdparty.robotexclusionrulesparser
- wpull.thirdparty.tornado
We would like to acknowledge the authors of GNU Wget as Wpull uses algorithms from Wget.