lanmaster53 / recon-ng

Open Source Intelligence gathering tool aimed at reducing the time spent harvesting information from open sources.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comparison of valid TLDs for hosts_to_domains

elreydetoda opened this issue Β· comments

First, thanks for an awesome tool! I really appreciate using it and I've been learning with/about it since I was in college! πŸ™‚ Also, just wanted to mention up front I'll happily do a PR myself (I've already done this locally and will post my code below), but just wanted to submit an issue to get an understand (+ check my logic) for if you'd like to have the PR first.

So, when using the recon/hosts-domains/migrate_hosts module one of the things that I've noticed, is that periodically when I run across a customers that have hosts in other countries (i.e. in the UK) they'll have something like .com.uk at the end of their domain. When running the aforementioned module, it'll actually make the domain entry be com.uk in the domains table (which means it didn't insert my client's domain which was prepended to that). So, then if I use a recon/domain-* module it'll run against the whole of the com.uk domain when their domain could be example.com.uk.

So, my thought was to pull the list of valid TLDs from here (directed to that page from here), and then check how many valid TLDs are in the domain. Once you do that, you can add some more conditions to check and see if the element (or part of the domain) is about the exceed the amount of valid TLDs, and add the domain to the list of domains as is instead i.e. example.com.uk instead of com.uk.

So, this was the area of code that I modified and added a helper function above it (plus one small addition to the __init__ function as well):

def hosts_to_domains(self, hosts, exclusions=[]):
domains = []
for host in hosts:
elements = host.split('.')
# recursively walk through the elements
# extracting all possible (sub)domains
while len(elements) >= 2:
# account for domains stored as hosts
if len(elements) == 2:
domain = '.'.join(elements)
else:
# drop the host element
domain = '.'.join(elements[1:])
if domain not in domains + exclusions:
domains.append(domain)
del elements[0]
return domains

class BaseModule

    def __init__(self, params)
        # added this, so that way it can cache the TLDs after it's first run and doesn't try to add them everytime.
        #   Also, using a set, because that'll ensure there are no duplicates as well as make for fast lookups
        self._valid_tlds = set()

    def tld_nums(self, host_sections: List[str]) -> int:
        """
        this checks how many tlds the domain has and returns
        that amount, so the host doesn't end up only being tlds
        """
        if len(self._valid_tlds) == 0:
            with open('/usr/share/recon-ng/tlds-alpha-by-domain.txt') as tld_f:
                self._valid_tlds = { tld.lower().strip() for tld in tld_f }
        # since we're comparing the first section initially,
        #   we're starting at 1
        counter = 0
        for section in reversed(host_sections):
            if section.lower() not in self._valid_tlds:
                return counter
            counter += 1

    def hosts_to_domains(self, hosts, exclusions=[]):
        domains = []
        for host in hosts:
            elements = host.split('.')
            tld_count = self.tld_nums(elements)
            # recursively walk through the elements
            # extracting all possible (sub)domains
            while (len(elements) >= 2) and (len(elements) > tld_count):
                # account for domains stored as hosts
                if (len(elements) == 2) or (len(elements) == (tld_count + 1)):
                    domain = '.'.join(elements)
                else:
                    # drop the host element
                    domain = '.'.join(elements[1:])
                if domain not in domains + exclusions:
                    domains.append(domain)
                del elements[0]
        return domains

If I was to do a PR, I'd be curious if I should include the .txt of TLDs + a way to refresh that (i.e. a requests.get to that URL) or maybe that's something that happens when package gets distributed?

All module related issues, including dependencies, should be raised in the module repository.