kiprenko / data-mining-purple

This repository contains my data mining laboratory works from the 4th course of Computer Science in KhNU by the name of V. N. Karazin.

Home Page:https://www.linkedin.com/in/kiprenko/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data-mining-purple

This repository stores my simple data-mining scripts. These tasks were implemented in the scope of my university study at the 4th course. The repository contains only Python scripts. You may need additional (very popular and easy to install) Python libraries. Such as:

  • matplotlib
  • networkx
  • numpy
  • nltk
  • pandas

I won't list all of them here, because I'm sure you will figure out what libraries you require and how to install them on your own.
The "purple" word was added to the repository name just to differ the repository from my other possible data-mining repositories.

Description

The repository consist of five parts:

  1. data-statistical-analysis
  2. naive-bayes-classifier
  3. k-means-method
  4. page-rank
  5. market-basket-analysis

All these parts have their own folder where you find everything you need to run the scripts.

data-statistical-analysis

Open folder

In this part I made a data statistical analysis of the file sms-spam-corpus.csv. This file contains spam and ham sms.
The part consist of three sub-parts:

  1. Plot on the graphs the distribution of word length for each category and the average word length
  2. Graph the distribution of message lengths for each category and the average message length
  3. Conduct a frequency analysis of the appearance of words for two categories. Display on the graphs 20 words that are most common for each category separately

You must run the data_processor.py before running the other scripts.

python data_processor.py

Then, choose the part you need and ran the script. For example, I will run the first sub-part:

python task_1_word_len_plot.py

After this, the script will show two plots (for ham and for spam words).

ham-words-plot

ham-words-plot

You can find the average word length in the console (terminal) output:

Average word length for spam is 5.339024390243902
Average word length for ham is 5.131272645653637

Also, all three sub-parts scripts save their plot in the specific folder.

naive-bayes-classifier

Open folder

This part calculates message spam and ham possibility using training data in ham_words_count.csv and spam_words_count.csv files.

Run the script by the next command:

python main.py

Then, you will see the next output:

Welcome to Saint-Tropez
Spam file name is "spam_words_count.csv"
Ham file name is "ham_words_count.csv"
Type your message:

You need to specify your message here. For example:

Welcome to Saint-Tropez
Spam file name is "spam_words_count.csv"
Ham file name is "ham_words_count.csv"
Type your message: Hello, world! How it's going?

After this you will see the result:

Preparing message...
Prepared message: "['world', 'go']"
Loading training data from files to memory...
Training data was successfully loaded
Calculating probability for both spam and ham...
Spam probability is "2.0651496689859363e-08"
Ham probability is "3.906686353734016e-08"
Making normalization...
Normalized:
Spam probability is "0.3458148651652589"
Ham probability is "0.6541851348347411"
Classifying the message...
The message "Hello, world! How it's going?" is a HAM

So, in the example above our classifier thinks that the message "Hello, world! How it's going?" is a ham.

k-means-method

Open folder

In this part, I implemented k-means method for a 2-D array of dots. The example input files you find in the points_datasets folder.
To run the script use the next command:

python main.py 7 points_datasets/s1.txt

Where:

  • 7 - the count of clusters I need (you may choose your own count);
  • s1.txt - the file with points data (you may use another file if you want).

After the execution, you will see the first plot, on which centers were randomly selected, and the dots were spread between them.

random-placed-centers.png

Close this plot and the program will start the classification. After the finish, the program will show you the final plot.

result.png

The count of steps needed to finish you will find in the console (terminal) output:

Program used 16 step(-s) to finish.

page-rank

Open folder

The script builds a page graph of the site and calculates the site's pages rank. You can run the script by the next command:

python main.py http://thedemosite.co.uk/

For our example, you will see the next graph

page-rank-graph-example

The page rank of the pages you will find in the console (terminal) output:

{
  "/": 0.018750000000000003,
  "index.php": 0.17862779158724276,
  "thedatabase.php": 0.17862779158724276,
  "addauser.php": 0.17862779158724276,
  "login.php": 0.17862779158724276,
  "getyourowndbonline.php": 0.17862779158724273,
  "addausercode.php": 0.044055521031893,
  "logincode.php": 0.044055521031893
}

market-basket-analysis

Open folder

In this part I implemented market basket analysis. The analysis may be done by apriori or by genetic algorithms.

CAUTION: the apriori script executes too long and has a very big impact on the processor. I don't recommend you to run it on a big amount of data.

To run the apriori algorithm script use the next command:

python apriori_alg.py 2 test.XLSX

Where:

  • 2 - support level (you may choose your own value);
  • test.XLSX - the file with purchases data (you may use another file if you want).

After the script execution, you will see the following in your console(terminal):

{
  "('84029G', '85123A', '84030E')": 2,
  "('84029G', '85123A', '84029E')": 2,
  "('84029G', '85123A', '85014B')": 2,
  "('84029G', '85123A', '84625A')": 2,
  "('84029G', '85123A', '85014A')": 2,
  "('84029G', '85123A', '85099F')": 2,
  "('84029G', '84029E', '85014B')": 2,
  "('84029G', '84029E', '84625A')": 2,
  "('84029G', '84029E', '85014A')": 2,
  "('84029G', '84029E', '85099F')": 2,
  "('15056BL', '15056N', '85099B')": 2,
  "('15056BL', '15056N', '85123A')": 2,
  "('15056BL', '15056N', '84030E')": 2,
  "('15056BL', '15056N', '84029E')": 2,
  "('15056BL', '15056N', '85014B')": 2,
  "('15056BL', '15056N', '84625A')": 2,
  "('15056BL', '15056N', '85014A')": 2,
  "('15056BL', '15056N', '85099F')": 2,
  "('85099C', '85099B', '85123A')": 3,
  "('85099C', '85099B', '84030E')": 3,
  "('85099C', '85099B', '84029E')": 3,
  "('85099C', '85099B', '85014B')": 3,
  "('85099C', '85099B', '84625A')": 3,
  "('85099C', '85099B', '85014A')": 3,
  "('85099C', '85099B', '85099F')": 3,
  "('85014B', '85014A', '85099F')": 2,
  "('85099B', '85123A', '84030E')": 2,
  "('85099B', '85123A', '84029E')": 2,
  "('85099B', '85123A', '85014B')": 2,
  "('85099B', '85123A', '84625A')": 2,
  "('85099B', '85123A', '85014A')": 2,
  "('85099B', '85123A', '85099F')": 2,
  "('85123A', '84030E', '84029E')": 2,
  "('85123A', '84030E', '85014B')": 2,
  "('85123A', '84030E', '84625A')": 2,
  "('85123A', '84030E', '85014A')": 2,
  "('85123A', '84030E', '85099F')": 2,
  "('84625C', '84625A', '85014A')": 2,
  "('84625C', '84625A', '85099F')": 2,
  "('84625C', '85169B', '15056N')": 2,
  "('84625C', '85169B', '85099B')": 2,
  "('84625C', '85169B', '85123A')": 2,
  "('84625C', '85169B', '84030E')": 2,
  "('84625C', '85169B', '84029E')": 2,
  "('84625C', '85169B', '85014B')": 2,
  "('84625C', '85169B', '84625A')": 2,
  "('84625C', '85169B', '85014A')": 2,
  "('84625C', '85169B', '85099F')": 2,
  "('85231G', '85231B', '84029G')": 2,
  "('85231G', '85231B', '85099C')": 2,
  "('85231G', '85231B', '85169B')": 2,
  "('85231G', '85231B', '15056N')": 2,
  "('85231G', '85231B', '85099B')": 2,
  "('85231G', '85231B', '85123A')": 2,
  "('85231G', '85231B', '84030E')": 2,
  "('85231G', '85231B', '84029E')": 2,
  "('85231G', '85231B', '85014B')": 2,
  "('85231G', '85231B', '84625A')": 2,
  "('85231G', '85231B', '85014A')": 2,
  "('85231G', '85231B', '85099F')": 2,
  "('85231G', '85099B', '85123A')": 2,
  "('85231G', '85099B', '84030E')": 2,
  "('85231G', '85099B', '84029E')": 2,
  "('85231G', '85099B', '85014B')": 2,
  "('85231G', '85099B', '84625A')": 2,
  "('85231G', '85099B', '85014A')": 2,
  "('85231G', '85099B', '85099F')": 2,
  "('85231B', '85099B', '85123A')": 2,
  "('85231B', '85099B', '84030E')": 2,
  "('85231B', '85099B', '84029E')": 2,
  "('85231B', '85099B', '85014B')": 2,
  "('85231B', '85099B', '84625A')": 2,
  "('85231B', '85099B', '85014A')": 2,
  "('85231B', '85099B', '85099F')": 2
}

Where the key is a triple of products, and the value is the count of this triple popularity in the baskets.

To run the genetic algorithm script use the next command:

python genetic_alg.py "Filtered Online Retail.xlsx"

Where:

After the script execution, you will see the TOP-10 product triples in your console(terminal):

 [
  [
    "20728",
    "20726",
    "22382",
    180
  ],
  [
    "21928",
    "85099B",
    "22386",
    130
  ],
  [
    "22423",
    "22697",
    "47566",
    126
  ],
  [
    "85099C",
    "23199",
    "85099B",
    123
  ],
  [
    "22423",
    "22697",
    "85123A",
    104
  ],
  [
    "23199",
    "23344",
    "22382",
    86
  ],
  [
    "20728",
    "20726",
    "21212",
    84
  ],
  [
    "21928",
    "85099B",
    "20713",
    84
  ],
  [
    "22423",
    "22697",
    "21212",
    78
  ],
  [
    "20728",
    "20726",
    "22086",
    78
  ]
]

Author 🦝