Detection of malicious domains via a large scale network analysis

This is my master thesis written jointly at Department of Mathematics and Its Applications and Department of Network and Data Science at Central European University in Budapest over the years 2018-2019, supervised by Gerardo Iñiguez, PhD. The work was done in close partnership with cybersecurity firm ESET. This domain reputation model is called Kassiopea.

In this work, we tried to determine malicious domains with minimal ground information propagated through a large bipartite temporal network of domains and their hosts (often referred as Passive Domain Name Server (PDNS)). The propagation is done by stochastic process of Voter model [2, 8, 16, 18], where nodes can be in three states blacklisted, whitelisted and unknown. Initial information of blacklisted and whitelisted nodes stay fixed over the whole process, while rest of the unknown nodes can flip their states into those two categories. Fixed nodes we called zealots, while others are susceptible. We run the Voter model in multiple realizations, the states of nodes assigned when the process ends / staturates are averaged. The averages are then labels for each domain.

Abstract

In order to protect users from spam, financial scams or malware, security companies, such as ESET tend to block dangerous domains and Internet Protocol (IP) addresses. Many of them are chronically known for spreading malware and thus blacklisted, while others are known as clean and whitelisted sources. However, most dangerous domains/IPs are unknown. The aim of this project is to assign a malware probability to domains/IPs using a large scale data on a temporal bipartite network. We model the associated reputation problem as a network interference and graph mining problem, where we construct layers of domains and IP addresses, and seed the network with empirical ground truth on malware sources. Then we run the voter model of information spreading to estimate marginal probabilities of domains/IPs being blacklisted. Our analysis provides an intuitive, scalable way of identifying previously unknown, dangerous sources online.

The entire thesis can be find here thesis_matej_kerekrety.pdf.

Examples

Unfortunately, exact codes and data can't be shared, while the results and sample examples are provided. We tested and developed a few variants of the model:
• Bipartite version, network of domains and hosts
• Projected version, initially bipartite network was projected on to network of domains [1, 7]

In order to test the importance of initial conditions / ground information, we test is on two synthetics networks:
• Randomly shuffled links) We kept the node's susceptibility and labels as in the original network but we shuffled the links at random keeping the degree distribution of each node.
• Randomly shuffled initial information We kept the network structure and topology as it is. Nodes and links were in the original configuration. We also kept the susceptibility of nodes, but we shuffled the zealot's labels at random.

Finally we, test the accuracy, True Positive and False Positive Rates:
• General validation

References:

[1] Suman Banerjee, Mamata Jenamani, Dilip Kumar Pratihar: Algorithms for Projecting a Bipartite Network, (August 2017) https://www.researchgate.net/publication/323067832_Algorithms_for_projecting_a_bipartite_network
[2] Federico Vazquez, Víctor M Eguíluz: Analytical solution of the voter model on uncorrelated networks, (June 2008) https://iopscience.iop.org/article/10.1088/1367-2630/10/6/ 063011/pdf
[3] Justin Ma, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker: Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs, (2009) http://cseweb.ucsd.edu/jtma/papers/beyondbl-kdd2009.pdf
[4] Dhia Mahjoub, David Rodriguez: Beyond lexical and PDNS: using signals on graphs to uncover online threats at scale, (2017) https://www.virusbulletin.com/uploads/pdf/magazine/2017/VB2017-Mahjoub-Rodriguez.pdf
[5] Pratyusa K. Manadhata, Sandeep Yadav, Prasad Rao, and William Horne: Detecting Malicious Domains via Graph Inference, (2014) http://www.covert.io/research-papers/security/Detecting malicious domains via graph inference.pdf
[6] Leyla Bilge, Engin Kirda, Christopher Kruegel, and Marco Balduzzi: EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis, (February 2011) https://sites.cs.ucsb.edu/chris/research/doc/ndss11_exposure.pdf
[7] Tao Zhou, Jie Ren, Matus Medo, Yi-Cheng Zhang: How to project a bipartite network?, (Jul 2007) Physical Review E 76, 046115 https://arxiv.org/pdf/0707.0540.pdf
[8] Juan Fernández-Gracia, Krzysztof Suchecki, José J. Ramasco, Maxi San Miguel, Víctor M. Eguíluz: Is the Voter Model a model for voters?, (June 2014) https://arxiv.org/pdf/1309.1131.pdf
[9] Kevin P. Murphy Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series). (2012) The MIT Press; 1 edition
[10] Mark Newman. Networks: An Introduction., (May 2010) Oxford University Press; 1 edition.
[11] Barabási, A. L., Pósfai, M. Network science, (2016) Cambridge: Cambridge University Press. ISBN: 9781107076266 1107076269
[12] Mark Felegyhazi, Christian Kreibich, Vern Paxson: On the Potential of Proactive Domain Blacklisting, (April 2010) https://www.usenix.org/legacy/event/leet10/tech/full_papers/Felegyhazi.pdf
[13] M. Mobilia, A. Petersen, S. Redner: On the Role of Zealotry in the Voter Model, (2 Aug 2007) https://arxiv.org/pdf/0706.2892.pdf
[14] Duen Horng Chau, Carey Nachenberg, Jeffrey Wilhelm, Adam Wright, Christos Faloutsos: Polonium: Tera-Scale Graph Mining and Inference for Malware Detection, (2011) https://www.cc.gatech.edu/ dchau/polonium/polonium_sdm2011.pdf
[15] Claudio Castellano, Santo Fortunato, Vittorio Loreto: Statistical physics of social dynamics (2009),
Reviews of Modern Physics 81, 591-646 https://arxiv.org/pdf/0710.3256.pdf
[16] Juan Fernández-Gracia: Updating rules and the voter model, (January 2011) http://digital.csic.es/bitstream/10261/46143/1/tesinaMaster.pdf
[17] Hung Le, Quang Pham, Doyen Sahoo, Steven C.H. Hoi: URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection, (March 2018) https://arxiv.org/pdf/1802.03162.pdf
[18] V. Sood, Tibor Antal and S. Redner: Voter models on heterogeneous networks, (2008) https://www.maths.ed.ac.uk/~antal/Mypapers/voter08.pdf

matejker / Kassiopea

Detection of malicious domains via a large scale network analysis

Abstract

Examples

References:

About

Languages