GiorgosNik / url-tracker

Multi-process application that tracks the creation of files in a directory and analyses the contents, aiming to showcase the use of various IPC methods

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

url-tracker

Description

This project aims to track the creation of new files in a specified directory, and to then analyse the web locations mentioned in these files. It is composed of two main parts, the sniffer and the finder.

The sniffer

This component is used to monitor a directory (by default the same one the program is placed in) for new files. If any new files are created, the sniffer finds and counts the number of web locations ins URLs in the file. This information is then stored in a file, sharing the name of the created file, with the ".out" extension. The sniffer is implemented using C++, and is itself composed of 3 main components:

  • The Manager is the main process of the program. It spawns the Listener and the Workers using execl and fork calls respectively. It receives messages regarding the creation of new files in the directory from the listener using a pipe. It then assigns the processing of the file to a worker. If no worker is available, a new one is created. After a worker is made available, the name of the file to be examined is written to the named pipe of the specific worker. The named pipes are stored under /tmp/. The manager is also responsible for the termination of the program. At the start, it creates two lists, one to keep the information on the availability, pid and number id of the workers. The second list contains just only the pip of all spawned processes. When receiving a SIGINT signal, the Manager will terminate all the spawned processes and then exit.
  • The Listener is created during the creation of the program with a fork call. It then sets up a pipe, and redirects standard output to it. After that, it executes an execl call, replacing itself with a call to inotifywait, configured to monitor the selected directory.
  • The Workers are created by the manager one at a time, everytime a new file is created, if a worker is not available to work on it. Workers are numbered, starting from 1. This id is used to determine the named pipe they will use to receive the name of the files they will examine. When they receive a file, they first read its contents, and filter out any URLs within. Each URL the Worker finds, is stripped of any uneccesary information, in such a way that only the location remains. For example, given the URl: https://github.blog/changelog/ the information to keep is :github.blog. A list of these locations is created and used to count how many times each location is referenced. After counting, the Worker creates a new file named [filename].out, where "filename" is the name of the file received from the Manager. These files are stored under /tmp/snifferOut/. Note that this folder is deleted and before any new execution of the sniffer.

The finder shell script

The finder, given a number of domains, loops through them, and after examining all.out files generated by the sniffer, counts the total number the domain is used. It then displays this information for every given domain.

Creating and running the executable and shell script

  • In order to compile the program, use the included Make file. To create the executable use make all.
  • To clean the directory by removing the.o object files, use make clean.
  • In order to run the executable, use ./sniffer. By default, the program will monitor the current directory.
  • In order to monitor a different directory, use ./sniffer -p [path], where path is the directory you want to monitor
  • In order to use the shell script, use ./find.sh [domain names], where domain names are one or more domain names you wish to count

Files

The C++ executable is created from the following files:

  • main.cpp: It forms the base of the program, containing the main loop of the manager. Used to create the Listener and to perform general setup
  • worker.cpp and worker.hpp: Include the main functions used by the worker. The function workerMain is the one that continuously loops and receives the input from the manager in the form of filenames, open the file and get the links, and workerOutput is the function that given the links in a list, will count them and write to the output file.
  • manager.cpp and - manager.hpp: These include the functions used by the manager (other than the main loop implemented in manager.cpp). These include the function that assigns the files to workers, either by finding a free worker or creating a new one, and the signal handlers. The program makes use of two signal handlers, one to handle SIGCHLD signals sent by the workers when they stop or continue, and one to stop handle SIGINT. The first handler, called procStop, will wait for status updates from any pid. After getting the information on the pip that changed status, as well as the new status of that process, the handler will update the status of the worker, stored on the workerList. The second handler loops through the children of the main process and propagates the SIGINT signal to them. After receiving confirmation that they have exited, the main process exits itself.

Considerations

Many functions of the sniffer rely on the STL library. Due to this, many structures handled by STL are not freed when the program exits after a SIGINT call.

About

Multi-process application that tracks the creation of files in a directory and analyses the contents, aiming to showcase the use of various IPC methods

License:MIT License


Languages

Language:C++ 95.0%Language:Shell 2.6%Language:Makefile 2.4%