Vector Space Model for DNS (for short, DNS-VSM) is a set of pre-trained vectors (a.k.a embeddings) for 40000 Internet domain names. These embeddings were built as part of the work presented in [1], [2] and [3].
Domain names in the DNS-VSM are represented by vectors where related domain names are mapped to nearby points in the high dimensional space. The DNS-VSM was built only using information of DNS queries (trained on a similar dataset to the one used in [1]) without any other previous knowledge about the content hosted in each domain.
DNS embeddings can be useful in many engineering activities, with practical application in many areas. Some examples include websites recommendations based on similar sites, competitive analysis, identification of fraudulent or risky sites, parental-control systems, UX improvements (based on recommendations, spell correction, etc.), click-stream analysis, representation and clustering of users navigation profiles, optimization of cache systems in recursive DNS resolvers, anomaly detection in network traffic analysis (among others).
It's important to note that many domains could have disappeared and many others could have been created in the last years. Also, since the DNS queries used for this project are not global (data used comes from only one ISP), then there could be some kind of bias on similar sites. For these two reasons, it's strongly recommended to use these vectors just for academic purposes and not for any production environment.
The pre-trained vectors for the DNS-VSM can be downloaded from here. Download and unzip the content to the models folder.
The folder structure should looks like:
-> models
-> ft
-> 21epoc_minn11_maxn17
-> 21epoc_minn11_maxn17.wv.syn0_ngrams.npy
This project was tested using python 3.6.4 and it requires gensim 3.1.0 (obs: newer versions of gensim may not work).
python 3.6.4
gensim 3.1.0
scipy 1.2.1
First, create the virtual environment
pyenv virtualenv 3.6.4 dns-vsm
Now, activate the new environment
pyenv activate dns-vsm
Once the environment is activated, install dependencies
pip install gensim==3.1.0
pip install scipy==1.2.1
Open a terminal, activate the dns-vsm virtual environment and type python to enter to the Python's interactive mode.
The DNS-VSM uses the gensim's wrapper for FastText, so in order to use the DNS-VSM you need to import the FastText wrapper as follows:
from gensim.models.wrappers.fasttext import FastText as ft
Now, you can load the pre-trained vectors in this way:
dns_embeddings = ft.load('models/ft/21epoc_minn11_maxn17')
Finally, you can use the DNS-VSM to query for similar domains:
dns_embeddings.most_similar('subrayado.com.uy', topn=12)
You should see the following output:
[('subrayado.com', 0.9160100221633911),
('diariolarepublica.net', 0.8355216979980469),
('eldiario.com.uy', 0.807044267654419),
('lr21.com.uy', 0.7994014024734497),
('teledoce.com', 0.7916869521141052),
('elecodigital.com.uy', 0.7739754915237427),
('causaabierta.com.uy', 0.7702589631080627),
('unoticias.com.uy', 0.7664780616760254),
('radiouruguay.com.uy', 0.7660830020904541),
('uypress.net', 0.742232084274292),
('sangregoriodepolancodigital.com.uy', 0.7303887605667114),
('vivomontevideo.com', 0.710254430770874)]
The following table analyzes the most similar sites to subrayado.com.uy (TV news).
Domain name | Type | Cosine distance | Observations |
---|---|---|---|
subrayado.com | Non existent domain | 0.961 | Same domain, but without country code ‘uy’ |
diariolarepublica.net | press, newspaper | 0.836 | Alias for larepublica.com.uy |
eldiario.com.uy | press, newspaper | 0.807 | |
lr21.com.uy | press, newspaper | 0.799 | |
teledoce.com | press, tv news | 0.792 | |
elecodigital.com.uy | press, newspaper | 0.774 | |
causaabierta.com.uy | - | 0.77 | Domain does not exist anymore (Jan-2019) |
unoticias.com.uy | press, newspaper | 0.766 | |
radiouruguay.com.uy | press, radio, newspaper | 0.766 | |
uypress.net | press, newspaper | 0.742 | |
sangregoriodepolancodigital.com.uy | press, newspaper | 0.73 | Domain does not exist anymore (Jan-2019) |
vivomontevideo.com | - | 0.71 | Domain does not exist anymore (Jan-2019) |
Most similar sites to subrayado.com.uy (TV news site)
Table above gives strong evidence about the model’s capability for capturing semantic information about domain names. Semantic similarity between domain names can be helpful in many scenarios, for example in the example above for recommending similar semantically related sites.
Other interesting use case where semantic similarity can be helpful could be for filtering adult content as part of a parental control system.
Suppose you know some content that you want to filter but not all of the them.
In that case you can find and filter contents that are similar to some specific sites like in the following example:
dns_embeddings.most_similar('pornhub.com', topn=10)
Domain name | Type | Cosine distance |
---|---|---|
youporn.com | adult website | 0.879 |
phncdn.com | adult website | 0.84 |
tube8.com | adult website | 0.795 |
youporn.com.es | adult website | 0.758 |
videospornhub.com | adult website | 0.708 |
xxxcupid.com | adult website | 0.696 |
german-youporn.com | adult website | 0.696 |
pornhubpremium.com | adult website | 0.693 |
genericlink.com | - | 0.687 |
youporngay.com | adult website | 0.68 |
Most similar sites to pornhub.com (an adult specific content site)
One of the most beautiful thing about word embeddings (in particular those embeddings that were trained using pedictive shallow neural network models) is analogical reasoning.
Analogical reasoning with word embeddings allows to apply simple arithmetic operations with the vector representations of words and find complex relationships between them.
There is a famous example when working with word embeddings where embedding("queen") is approximated as embedding("King") + embedding("Woman") - embedding("Man"). In other words, adding the vectors associated with the words king and woman while subtracting man is equal (or very similar) to the vector associated with queen.
This hidden linear structure of the vector space model for word embeddings can also be used in the DNS-VSM in a similar way to find complex relationships between domain names.
For example:
dns_embeddings.most_similar(positive=['atlantida.com.uy', 'maldonado.gub.uy'], negative=['canelones.gub.uy'], topn=3)
v1 | v2 | v3 | v1 + v2 - v3 |
---|---|---|---|
atlantida.com.uy (site related to Atlantida, the main resort in Canelones city) |
maldonado.gub.uy (site for the Maldonado city government) |
canelones.gub.uy (site for the Canelones city government) |
puntaweb.com puntadeleste.com (sites related to Punta del Este, the main resort in Maldonado city) |
Other example of analogical reasoning:
dns_embeddings.most_similar(positive=['puntashopping.com.uy', 'montevideo.gub.uy'], negative=['maldonado.gub.uy'], topn=3)
v1 | v2 | v3 | v1 + v2 - v3 |
---|---|---|---|
puntashopping.com.uy (site for a shopping center in Maldonado city) |
montevideo.gub.uy (site for the Montevideo city government) |
maldonado.gub.uy (site for the Maldonado city government) |
tiendasmontevideo.com montevideoshopping.com.uy (sites for shopping centers in Montevideo city) |
The previous examples show 2 of the 3 domain names nearest to the resulting vector v1 + v2 − v3. Analogical reasoning could be helpful for understanding complex relationships between domain names.
The DNS-VSM was built using character n-grams between 11 and 17 characters. When a domain name does not have a vector representation in the DNS-VSM but shares some of its n-grams with some other domain name that is part of the DNS-VSM, then the DNS-VSM can approximate its vector representation and use it for all common operations as if it were part of the original DNS-VSM. We can ilustrate this better through an example.
dns_embeddings.most_similar('samtanderuniversidades.con.uy', topn=9)
Domain name | Type | Cosine distance | Observations |
---|---|---|---|
santanderuniversidades.com.uy | banking | 0.995 | This is the real site |
bancamovilsantander.com.uy | banking | 0.953 | |
santander.com.uy | banking | 0.918 | |
multidiscount.net | banking | 0.811 | |
bcu.gub.uy | banking | 0.808 | |
discbank.com.uy | banking | 0.750 | |
browserforthebetter.com | - | 0.785 | Domain does not exist anymore (Feb-2019) |
brou.com.uy | banking | 0.751 | |
nbc.com.uy | banking | 0.749 | Domain does not exist anymore (Feb-2019) |
Most similar sites for an oov domain name.
Support for out-of-vocabulary (OOV) domain names, could be helpful for identifying domain names that for some reason are incorrect (like in the previous example), and also to find the correct match for them. A domain name could be bad formed because of many reasons, for example because it was typed incorrectly with a typo or because a harmful software shows a bad formed url intentionally (for example typosquatted domains or IDN homograph attacks) trying to deceive a user to redirect him/her to a website that looks identically to the original one but generally designed to steal user credentials, banking and credit card details (a.k.a phishing).
You can check this Jupyter Notebook with the code used in these examples and some others. You can also play with the DNS-VSM model in this Google Colab Notebook directly in the browser without any local environment configuration.
You can check this video (spanish) with the presentation of the DNS-VSM model during the Python's conference in Colombia on February 2020 (PyCon 2020).
If you use the DNS-VSM for any purpose, please cite the following works:
[1] W. Lopez, "Vector representation of Internet domain names using word embedding techniques," M.S. thesis, Instituto de Computación, Facultad de Ingenierı́a, Universidad de la República, Montevideo, Uruguay, 2019. (T)
[2] W. López, J. Merlino, P. Rodríguez-Bocca, "Learning semantic information from Internet Domain Names using word embeddings", Engineering Applications of Artificial Intelligence (ELSEVIER), Volume 94, 2020, 103823, ISSN 0952-1976.
[3] W. Lopez, J. Merlino and P. Rodriguez-Bocca, "Vector representation of internet domain names using a word embedding technique," 2017 XLIII Latin American Computer Conference (CLEI), Cordoba, 2017, pp. 1-8.
[4] J. Merlino and P. Rodríguez-Bocca, "Short-time prediction of DNS queries using deep learning and pre-trained word embedding," 2021 XLVII Latin American Computing Conference (CLEI), 2021, pp. 1-10, doi: 10.1109/CLEI53233.2021.9640221.