Random-Forest-Algorithm-using-Hadoop

Action rules, which are the modified versions of classification rules, are one of the modern approaches for discovering knowledge in databases. Action rules allow us to discover actionable knowledge from large datasets. Classification rules are tailored to predict the object’s class. Whereas action rules extracted from an information system produce knowledge in the form of suggestions of how an object can change from one class to another more desirable class. Over the years, computer storage has become larger and also the internet has become faster. Hence the digital data is widely spread around the world and even it is growing in size such a way that it requires more time and space to collect and analyze them than a single computer can handle. To produce action rules from a distributed massive data requires a distributed action rules processing algorithm which can process the datasets in all systems in one or more clusters simultaneously and combine them efficiently to induce single set of action rules. There has been little research on action rules discovery in the distributed environment, which presents a challenge. In this paper, we propose a new algorithm called MR – Random Forest Algorithm to extract the action rules in a distributed processing environment.

In our proposed implementation using the Hadoop MapReduce framework, the above described algorithms run in parallel in distinct threads as two separate jobs. LERS and AR in Job1, and AAR in Job2. Each job has its own Map and Reduce parts. The LERS, AR, and AAR algorithms are implemented in the Map part. Hadoop splits the data and gives splits of data to several Map parts (Mappers). The resulting action rules from all the Mappers are combined in such a way that the action rule acts as a key and the support and confidence from all the Mappers acts as iterator list of values. The combined action rules are given to the Reduce part, where we propose using a Random Forest type of algorithm in order to combine the output from all the Mappers. The Random Forest algorithm works in analogy to ‘voting’, where if more than 50% of the parties agree, the vote is accepted. In our proposed implementation, the Random Forest algorithm checks the output from all the Mappers, and if it finds an action rule which is generated from more than 50% of the Mappers it retains that action rule. If so, it averages all supports and confidences from these Mappers for the given action rule. Then, it checks the averaged support and confidence against the minimum support and confidence thresholds specified by the user. If the support and confidence thresholds are met, the action rule is retained, and included in the final list of action rules, produced as an output from this system, and presented to the user. Our proposed MR-Random Forest Algorithm, implemented in the Reduce part of MapReduce.

maheshsv / Random-Forest-Algorithm-using-Hadoop

Random-Forest-Algorithm-using-Hadoop

About

Languages