saji89 / phpcrawler

A website crawler implemented in PHP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PHPCrawler

A project to parse a website and get the count of all <img> tags in each link, within the site.

Requirements

  • PHP > 5.4
  • Mysql
  • PHPUnit

Setup

  • Setup the required dependencies using composer:
$ composer install
  • Create a new database, in the MySQL server, and run setupdb.sql in the newly created database, so as to setup the required tables.
  • Update the database settings in the config.ini file.
  • Set the desired debug level, in the verbosity section
    • 0 - Don't show any error.
    • 1 - Show only errors.
    • 2 - Show errors and warnings.
  • In the url section, set the URL to be parsed and also the maximum number of URL's to be parsed for images, from the total list of links retrieved and stored in the database.

How to run

$ php crawler.php

How to run the test cases

$ phpunit --bootstrap src/bootstrap.php src/Tests/CrawlerTest.php 

Future Ideas

  • Use Guzzle
  • Rewrite using some minimal PHP framework like Slim or Lumen
  • Implement better URL check
  • Time limiting for execution
  • Number of URL's based limiting
  • Level based limiting
  • Allow links from other domains
  • Refactor code
  • Create a setup file to initialise DB.
  • Use tput linux command to get dimension of console window
  • Implement proper unit tests, setup code coverage check, etc., using Travis CI, or other Continuous integration tool.

About

A website crawler implemented in PHP


Languages

Language:PHP 100.0%