Crawly
Inspired by webmagic, Crawly is an open source web crawler framework for Java which provides a fine-grained component structure. Using it, you can easily setup a web crawler for your production server.
Currently the project is still under rapid development and may change on daily basis. Also, the source code is not tested adequately, tremendous amount of bugs can be found when you're using it.
Use it with cautions.
Contribute
The section Future Milestones lists all the features will be added in the future. If you have other amazing features in mind, please post it in the Issue page of this repository, I'll reply ASAP.
Future Milestones
- Restart worker threads of
ConcurrentCrawler
when exception occurred. - Add test cases.
- Add support for Proxy (HTTP and SOCKS).
- Add support for Rate Limit.
- Add support for SMTP, POP3 and IMAP.
- Change the LICENSE.
- Release 0.1.
- Add more example crawlers and refactor the framework based on use experience.
- Release 0.2.
- Distributed crawler.