alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.

Home Page:https://alexklibisz.github.io/elastiknn

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Elastiknn plugin size (possibly change JSON library to reduce dependency sizes)

fabriziofortino opened this issue · comments

Starting from version 7.12.0.0, the plugin size went beyond 20MB (the previous version was 19.1MB).

Most Elastic cloud subscription levels have a hard limit of 20MB for custom plugins (see here).

It would be great if the plugin could stay within this limit.

I had a quick look at the project and I think some heavy dependencies can be removed with little changes. An example would be guava, currently used for base64 encoding and simple caching logic. Replacing this with Java 8 code would remove ~3MB.

What do you think? If this sounds reasonable I can work on this.

Hey @fabriziofortino . I wasn't aware of that limitation. It seems somewhat low given the tendency of JVM projects to bloat.

Anyways, perhaps there is some dependencies that could be removed, but I don't think Guava is one. The Guava base64 implementation was actually a lot faster than the standard JVM implementation the last time that I benchmarked it (late 2019 or early 2020, probably on Java 11). The caching logic is not complicated but does have a max-size eviction rule, and I'm not particularly excited to re-invent a cache. If there are smaller libraries that can serve those purposes well, I'll consider them.

Do you know of a way to list out all of the dependencies in the plugin and their sizes? Perhaps there are some that are still sizeable and would be easier to remove. I'm happy to give feedback if I can see a ranked list.

Or maybe there's some Gradle class-pruning step we could run before release? Also happy to give feedback there.

hi @alexklibisz ,

I see your points about Guava. What about these options:

  • Copy/paste the base64 relevant classes from guava or find an alternative lib that is at least as fast as guava but lighter
  • Use caffeine cache (https://github.com/ben-manes/caffeine). This would save ~2MB

Here is the list of libraries in the package sorted by size desc:

% ls -laSh
total 48016
-rw-r--r--@   1 fortino  staff   5.5M Sep 27 13:11 cats-core_2.12-2.4.2.jar
-rw-r--r--@   1 fortino  staff   5.2M Sep 27 13:11 scala-library-2.12.13.jar
-rw-r--r--@   1 fortino  staff   4.8M Sep 27 13:11 cats-kernel_2.12-2.4.2.jar
-rw-r--r--@   1 fortino  staff   3.1M Sep 27 13:11 shapeless_2.12-2.3.3.jar
-rw-r--r--@   1 fortino  staff   2.6M Sep 27 13:12 guava-28.1-jre.jar
-rw-r--r--@   1 fortino  staff   1.1M Sep 27 13:11 circe-core_2.12-0.14.0-M4.jar
-rw-r--r--@   1 fortino  staff   261K Sep 27 13:14 api4s-7.14.1.2.jar
-rw-r--r--@   1 fortino  staff   196K Sep 27 13:12 checker-qual-2.8.1.jar
-rw-r--r--@   1 fortino  staff   171K Sep 27 13:11 circe-generic_2.12-0.13.0.jar
-rw-r--r--@   1 fortino  staff   164K Sep 27 13:11 circe-generic-extras_2.12-0.13.1-M4.jar
-rw-r--r--@   1 fortino  staff   143K Sep 27 13:15 elastiknn-7.14.1.2.jar
-rw-r--r--@   1 fortino  staff    82K Sep 27 13:11 jawn-parser_2.12-1.1.0.jar
-rw-r--r--@   1 fortino  staff    29K Sep 27 13:15 models-7.14.1.2.jar
-rw-r--r--@   1 fortino  staff    28K Sep 27 13:11 circe-jawn_2.12-0.14.0-M4.jar
-rw-r--r--@   1 fortino  staff    19K Sep 27 13:12 jsr305-3.0.2.jar
-rw-r--r--@   1 fortino  staff    13K Sep 27 13:11 circe-numbers_2.12-0.14.0-M4.jar
-rw-r--r--@   1 fortino  staff    13K Sep 27 13:12 error_prone_annotations-2.3.2.jar
-rw-r--r--@   1 fortino  staff    12K Sep 27 13:14 lucene-7.14.1.2.jar
-rw-r--r--@   1 fortino  staff   8.6K Sep 27 13:12 j2objc-annotations-1.3.jar
-rw-r--r--@   1 fortino  staff   7.0K Sep 27 13:11 simulacrum-scalafix-annotations_2.12-0.5.4.jar
-rw-r--r--@   1 fortino  staff   4.5K Sep 27 13:12 failureaccess-1.0.1.jar
-rw-r--r--@   1 fortino  staff   3.4K Sep 27 13:12 animal-sniffer-annotations-1.18.jar
-rw-r--r--@   1 fortino  staff   3.1K Sep 27 13:11 macro-compat_2.12-1.1.1.jar
-rw-r--r--@   1 fortino  staff   2.8K Sep 27 13:11 circe-parser_2.12-0.14.0-M4.jar
-rw-r--r--@   1 fortino  staff   2.1K Sep 27 13:12 listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
-rw-r--r--@   1 fortino  staff   1.7K Sep 27 13:15 plugin-descriptor.properties
-rw-r--r--@   1 fortino  staff   309B Sep 27 13:10 plugin-security.policy

Correct. Getting rid of guava won't bring the package under 20MB. I was looking at it only because elastiknn uses a very limited set of functions provided by guava.

This is a nice up-to-date benchmark of various json libraries https://plokhotnyuk.github.io/jsoniter-scala/
borer seems to have good performance and zero dependencies (total size ~800Kb).

In general, I would try to shrink the package as much as possible since this is a plugin that gets loaded into every ES node. So the smaller the better.

If you are okay, I could try to replace guava as described in the comment above.

I have replaced guava caches with caffeine in the past for performance reasons. Here is a post that compares the 2 libraries:
https://medium.com/outbrain-engineering/oh-my-guava-we-are-moving-to-caffeine-99387819fdbb

Even if we are able to replace guava, I don't see any other packages that can obviously be removed to get under 20mb. I think replacing circe would be a bigger win.

Agree. I could help replace circe if that's okay.
There are a lot of json parsers for scala. borer and jsoniter look promising. Any preference?

That would be fantastic. Borer looks promising from reading the docs. I see some parallels w/ Circe, e.g., the deriveCodec[T].

In general, I want to minimize or at least centralize the JSON munging. Right now, it's mostly centralized in ElasticsearchCodec.

It will be a big lift, but a huge win to go from ~10mb to ~800kb.

Cool. I had a quick look at the ElasticsearchCodec already. I'll try to migrate the decoding logic with borer.

The reduction will be more than 10MB since also shapeless and other stuff will go away. I was able to build the dependency tree with gradle -q plugin:dependencies. Here are all the dependencies circe and guava bring in:

+--- project :api4s
|    +--- io.circe:circe-parser_2.12:0.14.0-M4
|    |    +--- org.scala-lang:scala-library:2.12.12 -> 2.12.13
|    |    +--- io.circe:circe-jawn_2.12:0.14.0-M4
|    |    |    +--- org.scala-lang:scala-library:2.12.12 -> 2.12.13
|    |    |    +--- io.circe:circe-core_2.12:0.14.0-M4
|    |    |    |    +--- org.scala-lang:scala-library:2.12.12 -> 2.12.13
|    |    |    |    +--- io.circe:circe-numbers_2.12:0.14.0-M4
|    |    |    |    |    \--- org.scala-lang:scala-library:2.12.12 -> 2.12.13
|    |    |    |    \--- org.typelevel:cats-core_2.12:2.4.2
|    |    |    |         +--- org.scala-lang:scala-library:2.12.13
|    |    |    |         +--- org.typelevel:cats-kernel_2.12:2.4.2
|    |    |    |         |    \--- org.scala-lang:scala-library:2.12.13
|    |    |    |         \--- org.typelevel:simulacrum-scalafix-annotations_2.12:0.5.4
|    |    |    |              \--- org.scala-lang:scala-library:2.12.13
|    |    |    \--- org.typelevel:jawn-parser_2.12:1.1.0
|    |    |         \--- org.scala-lang:scala-library:2.12.12 -> 2.12.13
|    |    \--- io.circe:circe-core_2.12:0.14.0-M4 (*)
|    \--- io.circe:circe-generic-extras_2.12:0.13.1-M4
|         +--- org.scala-lang:scala-library:2.12.13
|         \--- io.circe:circe-generic_2.12:0.13.0
|              +--- org.scala-lang:scala-library:2.12.10 -> 2.12.13
|              +--- io.circe:circe-core_2.12:0.13.0 -> 0.14.0-M4 (*)
|              \--- com.chuusai:shapeless_2.12:2.3.3
|                   +--- org.scala-lang:scala-library:2.12.4 -> 2.12.13
|                   \--- org.typelevel:macro-compat_2.12:1.1.1
|                        \--- org.scala-lang:scala-library:2.12.0 -> 2.12.13
\--- com.google.guava:guava:28.1-jre
     +--- com.google.guava:failureaccess:1.0.1
     +--- com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava
     +--- com.google.code.findbugs:jsr305:3.0.2
     +--- org.checkerframework:checker-qual:2.8.1
     +--- com.google.errorprone:error_prone_annotations:2.3.2
     +--- com.google.j2objc:j2objc-annotations:1.3
     \--- org.codehaus.mojo:animal-sniffer-annotations:1.18

Maybe we can also see if the scala 2.13 versions of Circe and its transitive deps are smaller? I doubt it, but I'm also not sure how to check this. I've been meaning to migrate up to 2.13 for a few months now.

I think swapping out the JSON library or doing some sort of exotic JAR pruning is the way to go.

The jars for 2.13 are the same:

$ curl -sLO https://oss.sonatype.org/service/local/repositories/releases/content/org/typelevel/cats-kernel_2.13/2.6.1/cats-kernel_2.13-2.6.1.jar
$ curl -sLO https://oss.sonatype.org/service/local/repositories/releases/content/org/typelevel/cats-core_2.13/2.6.1/cats-core_2.13-2.6.1.jar
$ du -hs cats-*jar
5.7M	cats-core_2.13-2.6.1.jar
5.1M	cats-kernel_2.13-2.6.1.jar

That is like an order of magnitude more than I would have guessed.

Here are the class counts

$ jar -tvf cats-core_2.13-2.6.1.jar | wc -l
2348
$ jar -tvf cats-kernel_2.13-2.6.1.jar | wc -l
1667

I guess in 99% of cases this doesn't matter. JVM code is usually running on a server where the only real penalty for a larger artifact is more time to pull the image or artifact.

Agree. Replacing the json library with borer seems to be the best thing to do to drastically reduce the plugin size.

I have anyway given a try updating the entire project to scala 2.13 and bump circe as well. Tests look okay except for ExactSimilarityFunctionSuite. I tried to dig in a bit but I am not sure why this does not work. Maybe you might want to have a look since it could be good to upgrade scala before changing the json library (see this branch https://github.com/fabriziofortino/elastiknn/tree/scala_2_13).

Hi @fabriziofortino . That looks like excellent work. I appreciate you taking the time to do it. I'll try my best to look into the test issue this week, or on the weekend at the latest. I agree it's a good idea to go ahead with the scala update first.

#317 LGTM. Unfortunately Github is having a hard time on the JVM tests. They all pass locally for me, so I'm not sure what is happening. Looking deeper ATM.

For the JSON library: I started looking deeper into Borer. I don't see much of an advantage of using Borer over just using the elasticsearch-x-content library, which contains XContentBuilder and XContentParser. This library is only ~150kb, so it's definitely small enough. Some reasons I'm thinking elasticsearch-x-content will be preferable:

  1. It will reduce some serde overhead in the plugin itself. Currently there are several places where the plugin code takes an XContentParser, materializes it to a Map, turns that Map into a circe Json ADT, and then decodes the Json to a Scala type. We can avoid a lot of allocations my just going from XContentParser to the Scala type.
  2. The current codec is already ~80% manual (i.e., uses the HCursor), and the API is stable. If the codec were largely derived or the API were rapidly evolving, I would see some advantage to learning another library's derivation quirks. However, in this case, we might as well just do the remaining 20% manually and avoid the overhead described in my previous point.

I took a first pass at a couple codecs here: https://github.com/alexklibisz/elastiknn/compare/elastiknn-316-xcontentcodec?expand=1

The main point is the XContentCodec[T] typeclass:

trait XContentCodec[T] {
  def buildUnsafe(t: T, x: XContentBuilder): Unit
  def parseUnsafe(p: XContentParser): T
}

My thinking right now is that this should replace the ElasticsearchCodec.

That looks great @alexklibisz !

Thanks. I changed the interfaces a bit (see the PR) since posting the comment, but still same idea. I'd like to add a couple abstractions for common patterns on the parsing side because that code is getting very tedious. Feel free to take a pass at it too if you're interested.

Hey @fabriziofortino. I made some progress on this. When you get a chance, have a look at: https://github.com/alexklibisz/elastiknn/releases/tag/7.14.1.2-PR321-SNAPSHOT

The release artifact is down to 8MB.

image

I'd like to add a handful more tests for error scenarios over in #321.

Thanks @alexklibisz . That looks great! I will review the PR asap.

Here's the new release: https://github.com/alexklibisz/elastiknn/releases/tag/7.14.1.3

Closing this now. Thanks for your help @fabriziofortino. Feel free to re-open if you find issues.