markostam / ann4s

Approximate Nearest Neighbors for Scala and Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status

Ann4s

A Scala Implementation of Annoy which searches nearest neighbors given query point.

Ann4s also provides DataFrame-based API for Apache Spark.

Scala code example

import ann4s._

object AnnoyExample {

  def main(args: Array[String]) {
    val f = 40
    val metric: Metric = Angular // or Euclidean
    val t = new AnnoyIndex(f, metric)  // Length of item vector that will be indexed
    (0 until 1000) foreach { i =>
      val v = Array.fill(f)(scala.util.Random.nextGaussian().toFloat)
      t.addItem(i, v)
    }
    t.build(10)

    // t.getNnsByItem(0, 1000) runs using HeapByteBuffer (memory)

    t.save("test.ann") // `test.ann` is compatible with the native Annoy

    // after `save` t.getNnsByItem(0, 1000) runs using MappedFile (file-based)

    println(t.getNnsByItem(0, 1000).mkString(",")) // will find the 1000 nearest neighbors
  }

}

Spark code example (with DataFrame-based API)

Item similarity computation

val dataset: DataFrame = ??? // your dataset

val alsModel: ALSModel = new ALS()
  .fit(dataset)

val annoyModel: AnnoyModel = new Annoy()
  .setDimension(alsModel.rank)
  .fit(alsModel.itemFactors)

val result: DataFrame = annoyModel
  .setK(10) // find 10 neighbors
  .transform(alsModel.itemFactors)

result.show()

The result.show() shows

+---+--------+-----------+
| id|neighbor|   distance|
+---+--------+-----------+
|  0|       0|        0.0|
|  0|      50|0.014339785|
...
|  1|       1|        0.0|
|  1|      36|0.011467933|
...
+---+--------+-----------+
  • For more information of ALS see this link
  • Working example is at 'src/test/scala/ann4s/spark/AnnoySparkSpec.scala'

Installation

resolvers += Resolver.bintrayRepo("mskimm", "maven")

libraryDependencies += "com.github.mskimm" %% "ann4s" % "0.0.6"
  • 0.0.6 is built with Apache Spark 1.6.2

References

About

Approximate Nearest Neighbors for Scala and Apache Spark

License:Apache License 2.0


Languages

Language:Scala 98.0%Language:Python 2.0%