DNS Log Parser

Overview

This Python script is designed to parse DNS log files, providing valuable insights into the activity recorded. By analyzing the log file, the script generates reports highlighting the number of records processed, the ranking of clients (IPs) based on the number of queries, and the most queried hosts.

Features

Accepts a log file name as a parameter for analysis.
Generates a summary of the processed records.
Ranks clients based on the number of queries made.
Identifies the most queried hosts.
Presents rankings with both total hits and the percentage they represent from the total records analyzed.
Sends the parsed data to the Lumu API.

Usage

To use the script, first you'll need to install the required dependencies:

Setup Virtual Environment and Install Requirements

To ensure a clean and isolated environment for running the DNS Log Parser, it is recommended to use a virtual environment. Follow the steps below to set up a virtual environment and install the required dependencies:

Create a Virtual Environment:

python -m venv venv

Activate the Virtual Environment:

On Windows:

.\venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate

Install Dependencies:

On Windows:

pip install -r requirements.txt

On macOS/Linux:

pip3 install -r requirements.txt

Now, your virtual environment is set up, and the required dependencies are installed.

After dependencies are installed, simply run the script with the desired DNS log file as a parameter, you can also input the collector ID and API key as parameters, this will send the parsed data to the lumu API. If these parameters are not provided, the script will only generate the stats report. Both the collector ID and API key are UUID strings.

On Windows:

python main.py -f <log_file_path> -c <collector_id> -k <api_key>

On macOS/Linux:

python3 main.py -f <log_file_path> -c <collector_id> -k <api_key>

Sample Output

Upon execution, the script will provide output similar to the following:

Parsed File Statistics:
Total records: 16967

Client IPs Rank
---------------  ----  ------
111.90.159.121   3375  19.89%
45.231.61.2      1251  7.37%
187.45.191.2     1089  6.42%
190.217.123.244   738  4.35%
5.63.14.45        634  3.74%
---------------  ----  ------

Host Rank
--------------------------------------------  ----  ------
pizzaseo.com                                  4626  27.26%
sl                                            3408  20.09%
MNZ-efz.ms-acdc.office.com                      67  0.39%
global.asimov.events.data.trafficmanager.net    31  0.18%
www.google.com                                  30  0.18%
--------------------------------------------  ----  ------

Computational Complexity of the Implemented Algorithms

Parsing Algorithm

The algorithm used to parse and store the log file data is based on a dictionary, which is a data structure that provides constant time complexity for insertion and retrieval of elements. This dictionary is composed of unique instances of hosts/clients and their respective number of queries. The algorithm iterates over the log file, and for each line extracts the needed information via regex (O(len) where len is the length of the string). Then, it checks if the host/client is already in the dictionary, if it is, it increments the number of queries, if it isn't, it adds the host/client to the dictionary with a value of 1. This way, the parsing algorithm has a time complexity of O(n), where n is the number of lines in the log file.

Ranking Algorithm

The ranking algorithm is based on the built-in Python module collections.Counter, which is a subclass of dict that provides a convenient way to keep track of the number of occurrences of elements in a list. This module provides a method called most_common, which returns a list of the n most common elements and their respective counts. This method has a time complexity of O(n log n), where n is the number of elements in the list. However, since we are only interested in the top 5 elements of each dict, when we utilize the most_common function to return k > 1 elements, it internally employs heapq.nlargest which has a time complexity of O(n log k), where n is the number of elements in the list/dict and k is the number of elements to return.

In the case of the DNS Parser k = 5, so the time complexity of the ranking algorithm is O(c log 5), where c is the number of unique clients or hosts. Since this operation is performed twice (once for clients and once for hosts), the total time complexity of the ranking algorithm is O(c log 5) + O(h log 5), where h is the number of unique hosts and c is the number of unique clients. Reducing the constants, the time complexity of the ranking algorithm is O(c) + O(h), which is essentially linear.

Total Time Complexity

The total time complexity of the DNS Parser is then O(n)*O(len) + O(c) + O(h), where n is the number of lines in the log file, len is the length of every line, c is the number of unique clients, and h is the number of unique hosts. Since the number of unique clients and hosts is always smaller than the number of lines in the log file, the total time complexity of the DNS Parser is O(n)*O(len).

ElReyZero / lumu-dns-log-parser