jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory issues

sidfeiner opened this issue · comments

Hello,
I've been using the the spark evaluator with a few PMMLs and datasets, but now I'm using a 250MB PMML and a dataset with 129 columns and about 138k records but I very quickly get an OutOfMemoryError.
I reduced the dataset to 500 records (divided to 5 partitions) and had about 2GB on every executor and driver.
Had about 5-6 executors and still had an OOM.

During the process I dumped the heap to a file and from analyzing it, seems that org.jpmml.evaluator.mining.MiningModelEvaluator occupied 400MB, and there were 5 Threads that occupied about 250 MB each.
Does that make sense?

I didn't attach the dump because it's around 2GB but whatever you'll need I'll give it.

Thanks in advance :)

Some quick comments:

  1. A >100 MB PMML model is typically some decision tree ensemble model, such as random forest, gradient boosted trees (GBM, XGBoost) etc. Decision tree ensembles are well/easily compactable - a 250 MB PMML file, when parsed into an org.dmg.pmml.mining_model.MiningModel object and then interned, should not occupy more than ~100-150 MB of memory.
  2. The org.dmg.pmml.mining_model.MiningModel object is effectively immutable. You should try to find a way how to share the same object between different workers/threads. At the moment, I suspect that each worker has its own copy of it, which leads to OOM.
  3. The memory consumption does not depend on the dimensions of the dataframe. Each dataframe row is wrapped into an java.util.(Linked)HashMap instance, scored, and then unwrapped.

By "interning" I mean applying a list of Interner-type visitors to the org.dmg.pmml.PMML class model object right after it is loaded from the PMML file.

Basically, you should tweak the org.jpmml.evaluator.spark.EvaluatorUtil#createEvaluator(InputStream) utility method, and add the same behaviour as triggered by the --optimize and --intern command-line switches here:
https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator-example/src/main/java/org/jpmml/evaluator/EvaluationExample.java#L144-L156 and https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator-example/src/main/java/org/jpmml/evaluator/EvaluationExample.java#L211-L233

As for 2, I've noticed in the heap dump that I have 5 instances of the MiningModelEvaluator, but isn't that something that should be done by Spark? I only create my Evaluator using this repo and afterwards use the Transformer's transform method. I don't seem to have any control on the object's sharing.

Please share if you ever figure out a way to achieve 2 - would like to incorporate it into the codebase/documentation.

It would be desirable to have a "5-to-1 setup" where five org.jpmml.evaluator.mining_model.MiningModelEvaluator instances (cheap, maybe 10 kB memory usage) share the same org.dmg.pmml.mining_model.MiningModel object instance (very expensive, 400 MB memory usage) between each other. At the moment you have a "5-to-5 setup", because each evaluator instance has its own copy of the org.dmg.pmml.mining_model.MiningModel object instance. Most probably, these five copies of the same thing are created automatically by Apache Spark via object serialization/deserialization when splitting the job between five workers.

This setup is easy to fix in a regular Java application. But I don't know how to do it in Apache Spark applications - could involve toggling some configuration options (eg. "do not clone data when workers are running inside the same JVM").

As a starting point - do you know if your application invokes org.jpmml.evaluator.spark.EvaluatorUtil#createEvaluator(..) one time, or five times?

The solution to 2 might involve introducing some new "evaluator instance" sharing layer, which utilizes Apache Spark object sharing logic/mechanisms.

If they are being used in the same JVM (same executor), a broadcast variable can be used and that way, every object will be copied once per executor. I don't mind it to be copied once per executor, but I don't want it to be recreated in each thread.

Now I've rerun the job (this time in cluster mode) and from anylizing the new dumps, I can see that I have 12 million SimpleLocator objects, 7 million ScoreDistribution objects, 2 million tree.Node objects and 2 million SimplePredicate objects. This is only for 1 executor with 2 cores btw. Does this make sense or is there a memory leak somewhere?

A broadcast variable can be used and that way, every object will be copied once per executor.

Very interesting. How do you do that, Java code change?

from anylizing the new dumps, I can see that I have 12 million SimpleLocator objects ..

The SimpleLocator object keeps information about the physical location of XML elements in the original PMML file. They are only used for generating more informative exception messages ("the offending element ABC is located at line XYZ"), and can be safely deleted.

Replace org.jpmml.model.visitors.LocatorTransformer with org.jpmml.model.visitors.LocatorNullifier here, and these 12 million objects will be gone:
https://github.com/jpmml/jpmml-evaluator-spark/blob/master/src/main/java/org/jpmml/evaluator/spark/EvaluatorUtil.java#L58-L59

.. 7 million ScoreDistribution objects ..

So your model is a classification-type decision tree ensemble. You can intern repeated ScoreDistribution elements by applying the org.jpmml.evaluator.visitors.ScoreDistributionInterner visitor class (part of the org.jpmml:pmml-evaluator-extension module).

Visitor scoreDistributionInterner = new ScoreDistributionInterner();
scoreDistributionInterner.applyTo(pmml);

.. 2 million tree.Node objects ..

They are non-internable.

.. 2 million SimplePredicate objects

You can intern SimplePredicate elements by applying the org.jpmml.evaluator.visitors.PredicateInterner visitor class. Depending on your model's feature configuration (continuous, categorical etc.), you may witness different compaction rates. For example, for each boolean feature there only needs to be two unique org.dmg.pmml.SimplePredicate class model objects (one for the true case, and the other for the false case).

But you should apply different class model optimizers before doing any interning. You should be able to increase the throughput of individual workers considerably, thereby reducing the total number of workers (and memory consumption).

The EvaluatorUtil.createEvaluator(InputStream) utility method should take an extra enum-type argument, which indicates the environment mode:

  • Mode#DEVELOPMENT (and possibly Mode#TESTING). Keep SAX Locator information, do not apply any class model optimizers/interners.
  • Mode#PRODUCTION. Drop SAX Locator information, apply all known class model optimizers/interners.

A broadcast variable can be used and that way, every object will be copied once per executor.
Very interesting. How do you do that, Java code change?

It's something inside Spark. You can define a broadcast variable in the driver and it will be sent once to the executors, but in my opinion that should be handled already by the Dataframe/Transformer APIs inside Spark.

I'll try all those right now, and I've added already added seperate flags for the SAX Locator, interners and optimizers :)

I've tried everything you said and now my job is working as it should :)
After reducing each executor to have 2GB with 3 cores, seems i get an OutOfMemoryError: GC overhead limit exceeded. When I analyzed that dump, it seemed like 33% of the heap was occupied by Spark's MutableURLClassLoader but I guess that's not related to you anymore. If you think it is, tell me :) I still get 1.7 million RichSimplePredicate objects and 3 million DoubletonList. Any way to reduce that amount?

OutOfMemoryError: GC overhead limit exceeded

The optimization/interning of PMML class model objects does produce great amounts of garbage - these visitor classes are effectively reconstructing the object graph.

I can think of two remedies/workarounds:

  • Customize the JAXB runtime, so that during PMML unmarshalling there will be no SAX Locator information generated and attached to the class model. It's basically a matter of taking one dependency declaration out of the pom.xml build file.
  • Apply visitors in stages, and invoke System.gc() between them.

Have you talked to Boris and/or Iddo about your task? They should know a few tricks about visitors.

I still get 1.7 million RichSimplePredicate objects and 3 million DoubletonList. Any way to reduce that amount?

Probably not.

Each of those RichSimplePredicate instances holds a unique threshold value. And a DoubletonList should be the most memory-efficient representation of a two-element java.util.List.

What's the final size of org.jpmml.evaluator.mining_model.MiningModelEvaluator evaluator instances? Should be 150 to 200 MB in this point already?

Let's keep this issue open, because it contains two new ideas that I'd like to implement:

  1. Specifying the "environment mode" - one of DEVELOPMENT, TESTING or PRODUCTION.
  2. Sharing immutable (and expensive) content between workers using Apache Spark broadcast variables.