Cosa

Cosa is a simple web crawler that generates a database for use by other tools and reports.

It starts with a URL and domain name and will parse links returned by that URL, check all links to web pages, images, CSS, and script files. Links to HTML pages on the same domain will be recursively parsed until the queue finishes. Results from each URL and the link structure are contained in the database.

Cosa will only re-crawl URL's when the shelf life has expired or when specifically requested to re-fetch a URL. The default shelf life is one day, this can be changed in the config.

Dependencies

Cosa relies on the following ruby gems.

Typhoeus, a library for running HTTP requests
```
  gem install typhoeus
```
Sequel, a database toolkit
```
  gem install sequel
```
Trollop, a command line option parser.
```
  gem install trollop
```
RMagic, an interface between Ruby and the ImageMagick and GraphicsMagick image processing libraries.
```
  gem install rmagick
```

RMagick requires the ImageMagick library, which can be downloaded here

Nokogiri, an HTML parser
```
  gem install nokogiri
```

If you run into difficulty see Installing Nokogiri.

Installing Cosa

After downloading Cosa run gem build cosa.gemspec followed by sudo gem install cosa-0.3.1.gem in the directory where Cosa is located.

Running Cosa

First, rename sample_config.yaml to config.yaml and modify it to meet your needs.

You have three options when running Cosa.

  cosa crawl

Resume crawling from the first item in the queue.

  cosa crawl http://www.example.com [-options]

Cosa will start at this address, and crawl every page on the site.

  cosa crawl http://www.example.com/directory/ /directory/page/ [-options]

Cosa will start at 'http://www.example.com/directory/', and then only add links to the queue if they contain the pattern 'http://www.example.com/directory/page'.

Because Cosa stores the queue in the database, you can quit the program at any time and when you restart it will begin where it left off.

Using the data Cosa generates

Cosa uses a simple database with the following three tables:

urls – Each URL linked to from the site. Contains: url, date_accessed, content_type, content_length, status, response (the entire HTTP response body), validation_type, and valid
links – Stores the relationship between URLs. Once the crawl is complete, you can query this table to determine all URL's a given URL links to, and all URL's that link to a given URL.
queue - Working list of URLs that need to be crawled.
meta - List of IDs, that correspond to URLs in the urls table. Contains details about those links.

Help

Usage: cosa [options] crawl OR crawl [starting_url] OR crawl [starting_url pattern]
            [-i] [-n] [-b] [-u] [-q] [-e] [-S/-V] [-v] [-h]
            [-a url_one url_two] [-c config_file]
            [-l type] [-x exception] [-o /path/to/snapshot]
            [-g date] [-r seconds] [-t URL] [-f URL] [-I URL]

Commands:
crawl                   : Start the crawler. Look above for examples of usage.

Options:
--init, -i              : Command-line tool for creating and saving a config file.
--add, -a <s+>          : Add a URL (or multiple URLs, separated by spaces) to the queue.
--config, -c <s>        : Run Cosa with a given config file. Otherwise, Cosa will use the default config if it exists.
--broken, -b            : List all URLs that contain broken links, and their broken links.
--abandoned, -n         : List all pages that are no longer linked to.
--exception, -x <s>     : Add a regex exception to the config file given with the -c flag.
--info, -I <s+>         : Get information about the given url(s).
--list, -l <s>          : List all URLs of the given type.
--age, -g <s>           : List all URLs that are older than the given date.
--queue, -q             : List all URLs in the queue.
--clear-queue, -e       : Empty the queue.
--response-time, -r <f> : List all URLs that took longer than <seconds> to respond.
--unresponsive, -u      : List all URLs that were not responsive.
--to, -t <s>            : List all URLs that link to the given URL.
--from, -f <s>          : List all URLs that the given URL links to.
--silent, -S            : Silence all output.
--snapshot, -o <s>      : Export the entire site from Cosa as an HTML snapshot to the given full path.
--verbose, -V           : Verbose output.
--version, -v           : Print version and exit.
--help, -h              : Show this message.

Cosa currently supports SQLite and MySQL.

Juan de la Cosa

We named Cosa after Juan de la Cosa.

He made the earliest extant European world map to incorporate the territories of the Americas that were discovered in the 15th century, sailed with Christopher Columbus on his first three voyages, and was the owner/captain of the Santa María.