jjackson37 / WebMapper

Website mapper that takes a single URL.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

URL Extraction - Filtering

jjackson37 opened this issue · comments

Once the URLs have been retrieved from the RegEx they will most likely need to be filtered.
I think seperating out the filters into their own classes would be the best approach here, I can then create objects of them in a collection and loop the URL collection through them all.

I can think of two filters that we need so far :

  1. Error/Incorrect URLs (Possible it might need to run this one before the incomplete URL building?)
  2. URLs that link to different domains
  • Create filter data structure and interface
  • Create media file filter
  • Create duplicate address filter
  • Create domain URL filter
  • Create error URLs filter
  • Create 404 filter (Not sure about this one yet)
  1. Filter out unnecessary media content such as images, videos, and sound files.

The list of types could be managed in a config for now?

  1. Filter out pages that are unresponsive (resulting in 404 etc) though this could be an expensive process.

The above commit 450be63 actually references #2 not this issue

Pushed duplicate and media removal filters under 5706d93

We need configs to store the media types to remove.