Python Web Crawler

Created by Oliver Wilkins

This program will crawl through entire domains, exporting every link it can find into a txt file.

You will not need to download any libraries, plug-in and play by:

Downloading or cloning the repository
Running the main.py file
Links which the program saves are found in the queued.txt and crawled.txt files in the projects folder - the folder has example projects with queued.txt and crawled.txt

This program works by reading a webpage and extracting the links to the queued.txt file, when gotten round to the program will read further links from the queued.txt file and will then dump the then completed (crawled) webpage to the crawled.txt file
You can try to trawl through massive domains, with many links - this will take a VERY long time however
Also note that you may need to change the NUMBER_OF_THREADS variable in the main.py (line 12) file - this will depend on your operating system

NUMBER_OF_THREADS = 8

Add a tree view for all the links found
Reduce the number of decoding errors
Fix some URLs completely shutting down threads and ultimately the whole program. This issue is described in detail here
Create a nicer output to the console + a GUI

This program will crawl through entire domains, exporting every link it can find into a txt file.

Language:Python 100.0%