AkdenizKutayOcal / WebSearch-Dictionary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Author: Akdeniz Kutay Ocal

This project is developed for CMPE382 course. We were asked to build a dictionary 
for web search query logs and write into it. In this dictionary, each line will 
constitute a unique query followed by the frequency of the query in the Data files. 


Design overview:

    The project consists of different tasks and the codes that are seperated according
    to that. Detailed explanations of what is done in each tasks can be found in report.
    Data files that are used during execution can be found in Data file. The output 
    dictionaries of each task are stored in Resulting Dictionaries file.

    Task 1: 

        task1.c file contains Task1's implementation which is the Trie Data Structure.
        Can be executed by console commands;

        $ gcc -o task1 task1.c
        $ ./task1

    Task 2: 

        task2.c file contains Task2's implementation which is Sequential Execution - 
        One Query at a Time. Can be executed by console commands;

        $ gcc -o task2 task2.c
        $ ./task2
    
    Task 3: 

        task3.c file contains Task3's implementation which is Sequential Execution - 
        Multiple Queries. Can be executed by console commands;

        $ gcc -o task3 task3.c
        $ ./task3
    
    Task 4: 

        task4.c file contains Task4's implementation which is the Threaded Execution. 
        Can be executed by console commands;

        $ gcc -pthread -o task4 task4.c
        $ ./task4

    Task 5: 

        task5.c file contains Task5's implementation which is the Threaded Execution - 
        Multiple Tries. Can be executed by console commands;

        $ gcc -pthread -o task5 task5.c
        $ ./task5

    Task 6: 

        task6.c file contains Task6's implementation which is the Completely Memory-Based 
        Dictionary Creation. Can be executed by console commands;

        $ gcc -o task6 task6.c
        $ ./task6

    Task 7: 

        Improvements for previous tasks are done in Task 7. task7a.c is improved version of
        task2.c which is Improved Trie (Explained in Report). Can be executed by console 
        commands;

        $ gcc -o task7a task7a.c
        $ ./task7a
        
        task7b.c is improved version of task5.c (Explained in Report). Can be executed by
        console commands;

        $ gcc -pthread -o task7b task7b.c
        $ ./task7b

Complete specification:

    Every task and step that is done while writing the codes explained detaily in report that 
    can be found in zip in a pdf format. 

Known bugs or problems:

    We had given 10 data files with a very large number of queries but it was impossible 
    for me to work with them as they are given since my computer does not satisfy the 
    performance requirements. I had used a Tablet Laptop which has 4 GB of RAM and 4 cores 
    and I encountered with memory overflows and killed errors. That’s why I split the files 
    so that every 10 files have 50000 lines of queries to work on. Therefore I could not test
    my code in large files.

    Task5 caused memory overflow while working with this data. Therefore I had to asked one of
    my friends to run my code in his computer in order to compare the results in same dataset.
    The effect of running that in different computer and comparisons of results can be found in
    the report.
	
    my friends to run my code in his computer in order to compare the results in same dataset.
    The effect of running that in different computer and comparisons of results can be found in
    the report.

About


Languages

Language:C 100.0%