davide-ceretti / web-crawler

A web crawler that generates a sitemap for a given domain

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alt text Alt text

Assignment

Create a web crawler that generates a sitemap for a given domain.

Create an application that takes a domain name as a parameter (e.g. google.com) and crawls the site producing an XML sitemap.

Requirements:

  • Written in Ruby, Python or Clojure.
  • Visit each page only once.
  • Extract links from ‘href’ attributes on ‘a’ tags.
  • Ignore links to other domains.
  • Ignore links to subdomains other than ‘www’.
  • Produce an output file called ‘sitemap.xml’ that contains one entry per crawled page. (see http://www.sitemaps.org/protocol.html).
  • Each entry should contain a and element.
  • The tag should contain the absolute URL to the page.
  • The tag should contain how many times the page was linked to by other pages, scaled to fit in the range 0 to 1 (rounding to 2 decimal places).

Example Output

<?xml version="1.0" encoding="UTF­8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://www.google.com/</loc>
        <priority>1</priority>
    </url>
    <url>
        <loc>http://www.google.com/about</loc>
        <priority>0.97</priority>
    </url>
    .....
</urlset>

No 3rd party libraries are allowed for the main functionality (web­crawling, html parsing, writing XML).

System requirements

  • Python 2.7

Installation

pip install -r requirements.txt

Run

From the directory that contains this README.md:

  • python -m crawler --help
  • python -m crawler <domain> > sitemap.xml

Example:

python -m crawler http://proxybay.info/ > /tmp/sitemap.xml

Tests

python -m crawler.tests

About

A web crawler that generates a sitemap for a given domain

License:MIT License


Languages

Language:Python 100.0%