geekxingyun / smart-web-crawler

it is an easy Web Crawler With Java and Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SmartWebCrawler

it is an easy Web Crawler With Java and Python

Easy Web Crawler With Java

it is an easy Web Crawler with Java.

Language:Java
lib:Jsoup
Current latest version:v2.0

Key Point:

  1. Jsoup delayed access to the content of the page

  2. Then use Jsoup to parse the result of the request into a Document object

  3. Visit the page like a JS according to the Document API

Article introduction:Java爬虫获取某个页面中指定节点的内容

Easy Web Crawler With Python

Language:Python

third-party lib:urllib,beautifulsoup4

If you want to crawl all the a tags of a URL, then you may try this beautifulsoup4 based crawler project I wrote.

Get Started

  1. install python
https://www.python.org/downloads/
  1. install BeautifulSoup Python lib
https://www.crummy.com/software/BeautifulSoup/bs4/download/4.6/
  1. update pip version
python -m pip install --upgrade pip 
  1. come into beautifulsoup folder,type
pip install bs4
  1. run the program
  • Usage one:

The URL written in the default request code, records all the URLs directed by the a href tag in the UR

python SmartWebCrawler.py 
  • Usage two:

type the URL from the command line, and record all the URLs directed by the a href tag in the URL.

python SmartWebCrawler.py http://www.runoob.com/

more deatail please check the article as below:

article introduction:Python爬虫获取某个网页所有的a标签中的超链接网址

About

it is an easy Web Crawler With Java and Python.


Languages

Language:Java 81.3%Language:Python 18.7%