themistoklik / sparesort

Java library for sequential pattern mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sparesort

Sparesort is a Java library for performing sequential pattern mining, which is a data mining problem for analyzing sequencing or time related processes, e.g. customer purchase or web access patterns. This library implements BIDE, an efficient algorithm for mining frequent closed sequences. This is my own fork with tuned I/O for the needs of my thesis.

System Requirements

  • Java Runtime Environment (JRE) 1.5 or higher

Usage

From console

The distribution contains an executable JAR (sparesort-X.Y.Z.jar).

java -jar sparesort-X.Y.Z.jar <directory_holding_sequence_files> minsup

After mining is finished, all frequent closed sequences qualified by the threshold and their frequency are written in a file to avoid spamming the stdout (for large tasks it's cumbersome).

Within your code

  1. Create an instance of jp.ac.titech.cs.se.sparesort.SequenceDatabase<T> which supports addSequence(List<T>) and addSequence(T...) methods to load item sequences.

     // Create and load seqauence database
     SequenceDatabase<String> sdb = new SequenceDatabase<String>()
     	.addSequence("C", "A", "A", "B", "C")
     	.addSequence("A", "B", "C", "B")
     	.addSequence("C", "A", "B", "C")
     	.addSequence("A", "B", "B", "C", "A");
    
  2. (Optional) Set an alternative mining strategy. When no strategy is given, jp.ac.titech.cs.se.sparesort.bide.RecursiveBIDE<T> is used as a default strategy to perform depth-first mining based on recursive calls.

     sdb.setMiningStrategy(new RecursiveBIDE<String>());
    
  3. Execute mineFrequentClosedSequences(int) method giving an absolute minimum support threshold as its argument. The return value is a mapping from frequent sequences found in the sequence database and their frequency.

     Map<List<String>, Integer> result = sdb.mineFrequentClosedSequences(2);
    

Compile and build

Sparesort is built by using Apache Maven. Before you proceed, download Maven and configure the environment variable M2_HOME properly.

  1. Fork (or download) the source code.
  2. Open your terminal or command prompt, move to the directory containing pom.xml and hit mvn package there.
  3. After maven build succeeds, you will find sparesort-X.Y.Z.jar in target/ subdirectory.

References

BIDE algorithm was presented by Wang and Han in the follwoing paper:

  • Jianyong Wang, Jiawei Han: “BIDE: Efficient mining of frequent closed sequences”. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), 30 March - 2 April 2004, Boston, USA, pp. 79–90. doi:10.1109/ICDE.2004.1319986

Sparesort is designed for stimulating researchers in software and data engineering. If you use this library in your research and publications, please cite the following article as its source:

  • Hiroshi Kazato, Shinpei Hayashi, Tsuyoshi Oshima, Shunsuke Miyata, Takashi Hoshino, Motoshi Saeki: "Extracting and Visualizing Implementation Structure of Features". In Proceedings of the 20th Asia-Pacific Software Engineering Conference (APSEC 2013). Bangkok, Thailand, dec, 2013.

Copyright and license

Copyright (c) 2010-2012 Saeki Lab. at Tokyo Institute of Technology.
Sparesort is an open source software, licensed under the Apache Licese, Version 2.0.

About

Java library for sequential pattern mining

License:Apache License 2.0


Languages

Language:Java 100.0%