PintCoders/hadoop-apriori

In this assignment, you will implement Apriori algorithm in Hadoop

***********
1. Overview
***********

Apriori algorithm finds all itemsets with supports larger than or equal to minsup. 
Apriori algorithm makes multiple passes over the transaction data.
In the first pass, the support of individual items is counted and frequent items are determined (based on minsup).
In each subsequent pass, a seed set of itemsets found to be frequent in the previous pass is used for generating new potentially frequent itemsets, called candidate itemsets; 
then their actual support is counted while making a pass over the transaction data.
At the end of the pass, those that satisfy the minimum support constraint are collected, that is, frequent itemsets are determined, and they become the seed for the next pass.
This process is repeated until no new frequent itemsets are found. (In this assignment, we will determine number of pass)

In this assignment, you implement Apriori algorithm in MapReduce (Hadoop) framework. 
We briefly describe the algorithm in the following; in the description we use the following notations:
    L_k : frequent itemsets of length k.
    C_k : candidate frequent itemsets of length k.

To find the 1_itemsets (frequent itemsets of size 1, L_1) in MapReduce, we simply emit (item, 1) in the Map phase, then add the counts up in the Reduce Phase.
To generate C_(k+1) from L_k in MapReduce, we do the following:

 In Pre-Map phase: (in setup() function that is invoked for each map instance)
  1) Self-join L_k to create itemsets of size k+1.

  	In Self-joining step, you can only generate the (k+1)-itemSet with k-itemSets having the same itemSet up to (k-1) items
  	For example,
      Case 1
      Suppose we have 2 itemSet that have length 3 (list[0..2])
          [1, 2, 3], [1, 2, 4]

      In this case, you should check that the itemSets are the same from index 0 to index 1 and generate 4-newItemSet and pruning.

      Case 2
      Suppose we have 2 itemSet the have length 3 (list[0..2])
          [1, 2, 3], [2, 3, 4]

    In this case, you should only check index 0 and break.
    you don't need to generate the (k+1) itemSet

    Think about why consider generating (k+1)-newItemSet only for the same thing up to (k-1) consecutively.

  2) For each subset of the above itemsets, we check if it is in L_k; we do this check approximately by keeping hashcode of itemsets in L_k in a set and check against this set.
     If all of the subsets are in L_k, then we add the itemset to C_(k+1).
  3) Add the itemsets in C_(k+1) into the Trie (implemented in the previous assignment).

 In n-Map phase: (n > 1)
  4) Read part of transaction file and for each transaction, call findItemsets() function to get the itemsets appearing in the transaction.
  5) Emit the above itemsets as (itemset, 1).

 In Reduce phase:
  6) Add the numbers up and if it's above minsup, add the itemset into L_(k+1).

The following example illustrates how the above algorithm runs.

 * number of transactions = 6.
 * min_sup = 40%.
 * min_number of item = 2.

 Iteration 1 (find 1-itemsets):

    Database                                (A, 1)                               Output
    +-----+----------------+                (B, 1)                               +---------------+
    | TID | Items          |    +--------+  (C, 1) ...       +------------+      | <Item, Count> |
    | t1  | A, B, C        |--->| Map - 1| ----------------> | Reduce - 1 |      +---------------+
    | t2  | A, C           |    +--------+               +-> +------------+      | <A, 3>        | 
    | t3  | B, C, D, E     |                 ------------|                       | <B, 2>        |
    | t4  | A, D           |    +--------+  /                +------------+      | <C, 4>        |
    | t5  | E              |--->| Map - 2| ----------------> | Reduce - 2 |      | <D, 3>        |
    | t6  | C, D, E, F, G  |    +--------+                   +------------+      | <E, 5>        | 
    +-----+----------------+                                                     +---------------|
                                                                                        |
 Iteration 2 (find 2-itemsets)       +--------------------------------------------------+
                                     | Use the previous output
    Database                         | (in setup() build Trie of C_(k+1) )        Output
    +-----+----------------+         v                                           +---------------+
    | TID | Items          |    +--------+                   +------------+      | <Item, Count> |
    | t1  | A, B, C        |--->| Map - 1| ----------------> | Reduce - 1 |      +---------------+
    | t2  | A, C           |    +--------+               +-> +------------+      | <A C, 2>      | 
    | t3  | B, C, D, E     |                 ------------|                       | <B C, 2>      |
    | t4  | A, D           |    +--------+  /                +------------+      | <C D, 2>      |
    | t5  | E              |--->| Map - 2| ----------------> | Reduce - 2 |      | <C E, 2>      |
    | t6  | C, D, E, F, G  |    +--------+  ((A,C), 1)       +------------+      | <D E, 2>      | 
    +-----+----------------+                ((B,C), 1)                           +---------------|
                                            ...   (assuming (A,C), (B,C) are in the Trie)
    ...
    
 Iteration k
    
    (Repeated until L_k is empty)
    
[1] Apriori-based Frequent Itemset Mining Algorithms on MapReduce
http://delivery.acm.org/10.1145/2190000/2184842/a76-lin.pdf?ip=118.35.162.207&id=2184842&acc=ACTIVE%20SERVICE&key=0EC22F8658578FE1%2E6E075F2F25AA1826%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=922087149&CFTOKEN=93076259&__acm__=1491784730_ba949e073e89900d9ef6e87dfd86ca9b

[2] https://en.wikipedia.org/wiki/Apriori_algorithm
[3] BigdataSystem Lecture 9.

************
2. Template Code 
************
In the template code, we provide the following classes:

 AprioriDriver class
 AprioriPass1Mapper class
 AprioriPassKMapper class
 AprioriReducer class
 ItemSet class
 Transaction class
 Trie class
 TrieNode class
 AprioriUtils class

AprioriDriver class : set up excution information about mapreduce job.

AprioriPass1Mapper class :
  It read the dataset line by line and emit <Individual item, 1>
  The dataset line is composed of [line number] [itemset]

  For example, 
  sample.txt
  +---------------+
  |  1 1 3 4 2 5  |  -> line number : 0 Items : 1 3 4 2 5
  |  2 2 3 5      |  -> line number : 1 Items : 2 3 5
  |  3 1 2 3 5    |  -> line number : 2 Items : 1 2 3 5
  |  4 2 5        |  -> line number : 3 Items : 2 5
  + --------------+
  
  In this class, The Key is line number, txnRecord is items.
  And you should emit in this form: [individual item] 1
  i.e [1] 1
      [3] 1
      [4] 1

AprioriPasskMapper class:
  Pre-map phase
  n-Map phase

  you should emit in this form: [itemset] 1
  i.e if itemset length is 2,
      [1, 2] 1
      [2, 3] 1
      [3, 4] 1

AprioriReducer class
  Reduce phase

Transaction class:
  It represent transaction using list(Integer). 

AprioriUtils class:
  It contains utility methods for Apriori algorithm. 

TrieNode class
Trie class
  Refer to assignment3.

  You should use the class that you have implemented and change some method parameters.

  findItemSets(ArrayList<ItemSet> matchedItemSet, ArrayList<Integer> transaction)
  -> findItemSets(ArrayList<ItemSet> matchedItemSet, Transaction transaction)

ItemSet class
  Refer to assignment3. 
  But do not use the previous file. Becacuse previous ItemSet class have some errors.

********
3. What you need to implement 
*********

In the template code, you will find comments with COMPLETE (all in capital).
You are required to implement the necessary code in those locations.

Here we list all the functions you need to implement:

 1) AprioriPass1Mapper class
   - map()

 2) AprioriPassKMapper class
   - setup()
   - map()

 3) AprioriReducer class
   - reduce()

 3) Trie class
 - Use your assignment3 files.

 4) AprioriUtils class
 - getCandidateItemSets(..)
 - generateItemSetMap(..)
 - getSubSets(..)

********
5. Building the Project 
*********
In ferrari,
First, you should set environment variable.
       After typing vi ~/.bashrc

       At the bottom of the .bashrc, type exactly as shown below
+-------------------------------------------------------------------+
  .bashrc file
  
  export JAVA_HOME=/usr/lib/jvm/java-8-oracle
  export HADOOP_INSTALL=/scratch/hadoop-2.7.3
  export HADOOP_HOME=$HADOOP_INSTALL
  export PATH=$PATH:$HADOOP_INSTALL/bin
  export PATH=$PATH:$HADOOP_INSTALL/sbin
  export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
  export HADOOP_COMMON_HOME=$HADOOP_INSTALL
  export HADOOP_HDFS_HOME=$HADOOP_INSTALL
  export HADOOP_YARN_HOME=$HADOOP_INSTALL
  export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
  export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
  export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop/
+-------------------------------------------------------------------+

Second, After completing implementation, execute run.sh (./run.sh)
    Ignore this message.
      Note: ./src/apriori/AprioriDriver.java uses or overrides a deprecated API.
      Note: Recompile with -Xlint:deprecation for details.

      adding: apriori/(in = 0) (out= 0)(stored 0%)
      adding: apriori/AprioriPass1Mapper.class(in = 2060) (out= 914)(deflated 55%)
      adding: apriori/AprioriReducer.class(in = 2320) (out= 1013)(deflated 56%)
      adding: apriori/AprioriPassKMapper.class(in = 4613) (out= 2128)(deflated 53%)
      adding: apriori/AprioriDriver.class(in = 3351) (out= 1734)(deflated 48%)
      adding: utils/(in = 0) (out= 0)(stored 0%)
      adding: utils/AprioriUtils.class(in = 3058) (out= 1666)(deflated 45%)
      adding: trie/(in = 0) (out= 0)(stored 0%)
      adding: trie/TrieNode.class(in = 1179) (out= 670)(deflated 43%)
      adding: trie/Trie.class(in = 1795) (out= 961)(deflated 46%)
      adding: list/(in = 0) (out= 0)(stored 0%)
      adding: list/ItemSet.class(in = 1300) (out= 748)(deflated 42%)
      adding: list/Transaction.class(in = 957) (out= 565)(deflated 40%)

Third,  enter the folloing command: 
          hdfs dfs -put dataset/[dataset-file]
          (you can check if the file is stored in hdfs with the following command: hdfs dfs -ls)

Fourth, enter the folloing command:
          hadoop jar HadoopBasedApriori.jar apriori.AprioriDriver [dataset file] [output file] [numPass] [minSup] [number of Transaction]
          (Actually, [output file] is prefix output filename)

        After running the above command, you can see the result. (hdfs://[output file]/part-r-00000)
          hdfs dfs -cat output[passnum]/part-r-00000

        You can also copy files to the local file system
          hdfs dfs -get output[passnum]

Note: If you use the same output filename, you should always remove the output file before starting the mapReduce
		  hdfs dfs -rmr [remove file]
        
hdfs instruction : https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html#get

********
4. DataSet
********
1. sample 
   number of transaction : 6
2. mushroom.dat
   number of transaction : 8124
3. retail.dat
   number of transaction : 88162
4. connect.dat
   number of transaction : 67557

********
4. Usage Examples
********
hdfs dfs -put dataSet/sample
hadoop jar HadoopBasedApriori.jar apriori.AprioriDriver sample output 3 0.5 6

minsup = 50%

output1/part-r-00000

[1]	4
[2]	6
[3]	4
[5]	5

output2/part-r-00000

[1, 2]	4
[1, 3]	3
[1, 5]	3
[2, 3]	4
[2, 5]	5
[3, 5]	3

output3/part-r-00000

[1, 2, 3]	3
[1, 2, 5]	3
[2, 3, 5]	3

hdfs dfs -put dataSet/mushroom.dat
hadoop jar HadoopBasedApriori.jar apriori.AprioriDriver mushroom.dat output 5 0.5 8124 

.
.

output5/part-r-00000

[24, 34, 85, 86, 90]	4208
[34, 36, 39, 85, 86]	4346
[34, 36, 59, 85, 86]	4232
[34, 36, 85, 86, 90]	6272
[34, 39, 85, 86, 90]	4784
[34, 53, 85, 86, 90]	4608
[34, 59, 85, 86, 90]	4544
[34, 63, 85, 86, 90]	4304

********
5. Requirements 
*********

- You need to fill in the blanks in the template code.
  (Search the string 'COMPLETE' which marks the places where you need to implement your code) 

- You must write a report describing your implementation. The report need to be in word format or pdf format.

- You must name your submission file as following: 

   hw4_studentNumber.tar, hw4_studentNumber.pdf (or .docx)

   and submit both files (hw4_studentNumber.tar, hw4_studentNumber.pdf)

- You must fully describe the code you have implemented.

- You must test how much solution space is reduced by pruning.

- You must think about why consider generating (k+1)-newItemSet only for the same thing up to (k-1) consecutively in self-joining step.

*********
6. Submission
*********
You must submit the report for this assignment as well as your source code.

The report must be in PDF/Word format and include the following :

- Describe the code you have implemented.

- Test how much solution space is reduced by pruning.

- Describe why we consider generating (k+1)-newItemSet only for the same thing up to (k-1) consecutively in self-joining step.
PintCoders / hadoop-apriori

About

Languages