it is an easy Web Crawler With Java and Python
it is an easy Web Crawler with Java.
Language:Java
lib:Jsoup
Current latest version:v2.0
Key Point:
-
Jsoup delayed access to the content of the page
-
Then use Jsoup to parse the result of the request into a Document object
-
Visit the page like a JS according to the Document API
Article introduction:Java爬虫获取某个页面中指定节点的内容
Language:Python
third-party lib:urllib,beautifulsoup4
If you want to crawl all the a tags of a URL, then you may try this beautifulsoup4 based crawler project I wrote.
- install python
https://www.python.org/downloads/
- install BeautifulSoup Python lib
https://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/
- update pip version
python -m pip install --upgrade pip
- come into beautifulsoup folder,type
pip install bs4
- run the program
- Usage one:
The URL written in the default request code, records all the URLs directed by the a href tag in the UR
python SmartWebCrawler.py
- Usage two:
type the URL from the command line, and record all the URLs directed by the a href tag in the URL.
python SmartWebCrawler.py http://www.runoob.com/
more deatail please check the article as below:
article introduction:Python爬虫获取某个网页所有的a标签中的超链接网址