NoNeil / discomll

Disco Machine Learning Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

discomll

Disco Machine Learning Library (discomll) is a python package for machine learning with MapReduce paradigm. It works with Disco framework for distributed computing. discomll is suited for analysis of large datasets as it offers classification, regression and clustering algorithms.

Algorithms

Classification algorithms

  • naive Bayes - discrete and continuous features,
  • linear SVM - continuous features, binary target,
  • logistic regression - continuous features, binary target,
  • forest of distributed decision trees - discrete and continuous features,
  • distributed random forest - discrete and continuous features,
  • distributed weighted forest (experimental) - discrete and continuous features,
  • distributed weighted forest rand (experimental) - discrete and continuous features,

Clustering algorithms

  • k-means - continuous features,

Regression algorithms

  • linear regression - continuous features, continuous target,
  • locally weighted linear regression - continuous features, continuous target,

Utilities

  • evaluation of the accuracy,
  • distribution views,
  • model views.

Features of discomll

discomll works with following data sources:

  • datasets on the Disco Distributed File System,
  • text or gziped datasets accessible via file server.

discomll enables multiple settings for a dataset:

  • multiple data sources,
  • feature selection,
  • feature type specification,
  • parsing of data,
  • handling of missing values.

Installing

Prerequisites

  • Disco 0.5.4,
  • numpy should be installed on all worker nodes,
  • orange and scikit-learn are used in unit tests.
pip install discomll

Performance analysis

In performance analysis, we compare speed and accuracy of discomll algorithms with scikit and Knime. We measure speedups of discomll algorithms with 1, 3, 6 and 9 Disco workers.

Performance analysis 2##

In second performance analysis, we compare accuracy of distributed ensemble algorithms with scikit-learn algorithms. We train the model on whole dataset with distributed algorithms and on a subset with single core algorithms. We show that distributed ensembles achieve similar accuracy as single core algorithms.

Try it now

You can try discomll algorithms on the ClowdFlows platform. ClowdFlows is an open sourced cloud based platform for composition, execution, and sharing of interactive machine learning and data mining workflows. For instruction see the User Guide.

alt tag

Public workflows:

Release notes

version 0.1.4.2 (Released 18/oct/2015)

  • model view bug fixes for ensembles,
  • ensembles missing values support.

version 0.1.4.1 (Released 17/oct/2015)

  • model view fixed for ensembles,
  • bug fixes in examples and tests.

version 0.1.4 (Released 11/oct/2015)

  • distributed weighted forest Rand was added. Algorithm is similar to distributed weighted forest, but it uses randomly selected medoids.
  • improvements of algorithms, especially ensembles,
  • performance analysis 2.

About

Disco Machine Learning Library

License:Apache License 2.0


Languages

Language:Python 99.7%Language:Shell 0.3%