jjackson37 / WebMapper

Website mapper that takes a single URL.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

URL Extraction - Build incomplete URLs

jjackson37 opened this issue · comments

Need a way to convert incomplete URLs to usable ones, e.g. "/Foo/Bar" to "example.com/Foo/Bar".
The process needs to identify these URLs and convert them without affecting the currently complete URLs.
Possibly think about seperating out classes with RegEx that will return these URL types?

  • HTTP/HTTPS check
  • Top level domain check
  • Incomplete url check
  • Data strucutre / interface for checks
  • Source for top level domain check

We also need to handle URLs that are missing HTTP or HTTPS.

This is mainly for QoL reasons, but also we can extend this to examine and fix URLs for errors such as mistypes; "htp//" for example.

Need to check the input has a top level domain (.com, .uk, .gov, etc).

@jjackson37 how should we store these, new domains are added often, so it would need to in a config or eventual DB.

RetrieveFile method has been added to the WebClientPageRetriever. This takes a file URL and the directory in which to store the file and downloads it.