Latest Release | Build Status | Coverage | Code Quality | License | Chat |
---|---|---|---|---|---|
doddle-model
is an in-memory machine learning library that can be summed up with three main characteristics:
- it is built on top of Breeze
- it provides immutable estimators that are a doddle to use in parallel code
- it exposes its functionality through a scikit-learn-like API [2] in idiomatic Scala using typeclasses
doddle-model
is in an early-stage development phase. Any kind of contributions are much appreciated.
Installation
Add the dependency to your SBT project definition:
libraryDependencies ++= Seq(
"io.github.picnicml" %% "doddle-model" % "<latest_version>",
// add optionally to utilize native libraries for a significant performance boost
"org.scalanlp" %% "breeze-natives" % "1.0"
)
Note that the latest version is displayed in the Latest Release badge above and that the v prefix should be removed from the SBT definition.
Getting Started
For a complete list of code examples see doddle-model-examples. For an example of how to serve a trained doddle-model
in a pipeline implemented with Apache Beam see doddle-beam-example.
Performance
doddle-model
is developed with performance in mind, for benchmarks see the doddle-benchmark repository.
1. Native Linear Algebra Libraries
Breeze utilizes netlib-java for accessing hardware optimised linear algebra libraries (note that the breeze-natives
dependency needs to be added to the SBT project definition). TL;DR seeing something like
INFO: successfully loaded /var/folders/9h/w52f2svd3jb750h890q1x4j80000gn/T/jniloader3358656786070405996netlib-native_system-osx-x86_64.jnilib
means that BLAS/LAPACK/ARPACK implementations are used. For more information see the Breeze documentation.
2. Memory
If you encounter java.lang.OutOfMemoryError: Java heap space
increase the maximum heap size with -Xms
and -Xmx
JVM properties. E.g. use -Xms8192m -Xmx8192m
for initial and maximum heap space of 8Gb. Note that the maximum heap limit for the 32-bit JVM is 4Gb (at least in theory) so make sure to use 64-bit JVM if more memory is needed. If the error still occurs and you are using hyperparameter search or cross validation, see the next section.
3. Parallelism
To limit the number of threads running at one time (and thus memory consumption) when doing cross validation and hyperparameter search, a FixedThreadPool
executor is used. By default maximum number of threads is set to the number of system's cores. Set the -DmaxNumThreads
JVM property to change that, e.g. to allow for 16 threads use -DmaxNumThreads=16
.
Development
Run the tests with sbt test
. Concerning the code style, PayPal Scala Style and Databricks Scala Guide are roughly followed. Note that a maximum line length of 120 characters is used.
For a list of typeclasses that together define the estimator API see the typeclasses directory.
Resources
- [1] Pattern Recognition and Machine Learning, Christopher Bishop
- [2] API design for machine learning software: experiences from the scikit-learn project, L. Buitinck et al.
- [3] UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science, Dua, D. and Karra Taniskidou, E.