fengzhimei / pa11ycrawler

Python crawler (using Scrapy) that uses Pa11y to check accessibility of pages as it crawls.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status Coverage Status

Pa11ycrawler

pa11ycrawler is a Scrapy spider that runs a Pa11y check on every page of an Open edX installation, to audit it for accessibility purposes. It will store the result of each page audit in a data directory as a set of JSON files, which can be transformed into a beautiful HTML report.

Installation

pa11ycrawler requires Python 2.7+ and Node.js installed. You must also install phantomjs, which pa11y uses in order to run accessibility tests on each page of a website.

make install

Usage

scrapy crawl edx

There are several options for this spider that you can configure using the -a scrapy flag.

Option Default Example
domain localhost scrapy crawl edx -a domain=edx.org
port 8000 scrapy crawl edx -a port=8003
email None scrapy crawl edx -a email=staff@example.com -a password=edx
password None (see above)
http_user None scrapy crawl edx -a http_user=grace -a http_pass=hopper
http_pass None (see above)
course_key course-v1:edX+Test101+course scrapy crawl edx -a course_key=org/course/run
pa11y_ignore_rules_file None scrapy crawl edx -a pa11y_ignore_rules_file=/tmp/ignore.yaml
pa11y_ignore_rules_url None scrapy crawl edx -a pa11y_ignore_rules_url=https://...
data_dir data scrapy crawl edx -a data_dir=~/pa11y-data
single_url None scrapy crawl edx -a single-url=http://localhost:8003/courses/

These options can be combined by specifying the -a flag multiple times. For example, scrapy crawl edx -a domain=courses.edx.org -a port=80.

If an email and password are not specified, then pa11ycrawler will use the "auto auth" feature in Open edX to create a staff user, and crawl as that user. Note that this assumes that the "auto auth" feature is enabled -- if not, the crawler won't be able to crawl without an email and password set.

The http_user and http_pass arguments are used for HTTP Basic Auth. If either of these is unset, pa11ycrawler will not attempt to use HTTP Basic auth.

The pa11y_ignore_rules_file and pa11y_ignore_rules_url arguments allow you to specify a YAML file, or the URL to a YAML file, containing pa11y ignore rules. These rules are used to indicate that certain output from pa11y has been manually checked, and can be safely ignored.

The data_dir option is used to determine where this crawler will save its output. pa11ycrawler will run each page of the site through pa11y, encode the result as JSON, and save it as a file in this directory. This data directory is "data" by default, which means it will create a directory named "data" in whatever directory you run the crawler from. Whatever directory you specify, it will be automatically created if it does not yet exist. The crawler will never delete data from the data directory, so if you want to clear it out between runs, that's your responsibility. There is a make clean-data task available in the Makefile, which just runs rm -rf data.

The single_url option is available to allow the spider to only crawl one web page. The result is evaluated through the pipeline, but the spider will not continue crawling afterwards.

Transform to HTML

This project comes with a script that can transform the data in this data directory into a pretty HTML table. The script is installed as pa11ycrawler-html and it accepts two optional arguments: --data-dir and --output-dir. These arguments default to "data" and "html", respectively.

You can also run the script with the --help argument to get more information.

Cleaning Data & HTML

This project comes with a Makefile with a clean-data task and a clean-html task. The former will delete the data directory in the current working directory, and the latter will delete the html directory in the current working directory. These are the default locations for pa11ycrawler's data and HTML. However, if you configure pa11ycrawler to output data and/or HTML to a different location, this task has no way of knowing where the data and HTML are located on your computer, and will not be able to automatically remove them for you.

To remove data from the default location, run:

make clean-data

To remove HTML from the default location, run:

make clean-html

Running Tests

This project has tests for the pipeline functions, where the main functionality of this crawler lives. To run those tests, run py.test or make test. You can also run scrapy check edx to test that the scraper is scraping data correctly.

About

Python crawler (using Scrapy) that uses Pa11y to check accessibility of pages as it crawls.

License:Apache License 2.0


Languages

Language:JavaScript 73.5%Language:Python 22.8%Language:HTML 3.4%Language:Makefile 0.2%