crawler_operations

A module that represents an instance of the Crawler class as a NetworkX and Graphviz weighted digraph. These objects (instances of Crawler class) are generated by the 'crawler_f2.py' script, accessible here: https://github.com/mabhishetty/CrawlerFirstRelease

Motivation:

Although instances of class Crawler contain all the information necessary to understand the network, reading the text alone is not very illustrative. Particularly when visiting large numbers of websites, you might be interested in seeing which links are common between visited sites. Working this out by reading the lists of links from each visited site would be cumbersome - this is why we need a graphical representation. This has been included as a separate repository because the function stands alone. If other scripts generate instances of Crawler then they too may use this module to draw the graphs.

Structure and terminology:

For this script, websites can be divided into two categories: 'sites_visited' and 'links'

sites_visited: websites that the script visits and crawls
links: websites found as links on crawled sites. These websites are not visited and crawled by the program.

A very simple script, a summary of the structure is:

Change format of crawler - make a list of unique links from each site_visited and give multiplicity
Create a set, 'repeated', of links that can be reached from more than one of the site_visited websites
Add site_visited nodes
Begin main loop
- Take a visited site
- Check that it doesn't already have a node
  - If it does, add only an edge
- Check that the site isn't in 'repeated'
  - If it is, give it a special colour
- Add nodes for this site - with appropriate colour, shapes and tooltip links
- Add directed edges between the site_visited and each of those links
Repeat loop

Usage:

To use this script, just: import crawler_operations and run the function: crawler_operations.crawler2networkx(crawler). The function takes an instance of class Crawler and returns nothing - but does save a '.svg' file which contains the digraph.

In addition: a number of modules are required for the operation of this script. They are:

matplotlib.pyplot : on which NetworkX operates
networkx : creates an empty Digraph and provides some operations for the graph
math : from which we use a 'floor' function to round down division
numpy : element-wise multiplication
networkx.drawing.nx_agraph -> to_agraph : convert our networkx representation to a graph
pygraphviz : helps with visualising graph

To install these, I ran:

python3.6 -m pip install matplotlib
python3.6 -m pip install networkx For pygraphviz, install: homebrew -> graphviz -> pygraphviz. I ran:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)" (installing homebrew)
brew install graphviz
python3.6 -m pip install pygraphviz

and if you are using Jupyter Notebook

sudo ipython -m pip install pygraphviz

Example:

Here are two examples of the current output of the function:

The first:

The second:

Click the image to see a more detailed view.
This is in .png format - not .svg. This is to protect data.
- As part of my crawling policy (explained here: https://www.mycustomcrawlerexplanations.com), I have tried not to publish any website-specific data.
- Therefore the title is not displayed in the png - and tooltips won't be available to view
However when not used for examples, the script generates .svg files.

Features

The key features of the .svg graph are as follows:

A title gives the starting site and the number of sites visited
The key explains that:
- For colours:
  - A white node represents a site that may be reached from multiple of the 'sites_visited'
  - Any other-colour node is linked by only one of the sites_visited. The colour of the node matches the colour of the site_visited
  - Edges all have the colour of the site_visited from which they exit
- For shapes:
  - Circles signify ordinary links
  - Boxes signify sites_visited
  - Triangles indicate links with a url 'None'
  - Diamonds indicate links with urls that start with '#' - which jumps to content on the same page (this doesn't consider hash-bang syntax: https://www.oho.com/blog/explained-60-seconds-hash-symbols-urls-and-seo)
For numbers:
- Each of the sites_visited has an integer as its node label (the number seen)
- The other node labels show the order in which links were first uniquely found.
  - As the picture shows, 1's links reach very high numbers. However 2's links reach much lower numbers, despite 2 having a lot of links. This is because the majority of 2's links were already found when the crawler was on site 1. (This does not include sites_visited which are numbered before any other sites)
Tooltips for:
- Edges: Hover over an edge to see the multiplicity of the edge - how many times that exact link is found on the site_visited - which saves repeating the same edge multiple times
- Nodes: Hover over a node to see the url it represents

Known Issues/To do:

There seems to be a problem with the colours of the node fillings: they are being overwritten when nodes have multiple connections (solved)
The same problem is occurring with node labels - resulting in some odd numbering (solved)
Want to add a key to explain the symbols (solved - though not ideal)
Ideally there will be some interactivity in the plot where edges could be highlighted for easier understanding of the data (solved - but not ideal)
Could implement eventListeners to highlight subgraphs of particular sites_visited

Contact:

By email: mailto:manojabhishetty@gmail.com

mabhishetty / crawler_operations