OryxProject / oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Home Page:http://oryx.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ALS app: java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.String

srowen opened this issue · comments

Reports of a strange ClassCastException in ALS in master / 2.3:

2016-08-18 17:17:18,768 INFO  ALSServingModelManager:97 ALSServingModel[features:30, implicit:true, X:(7640666 users), Y:(1613282 items, partitions: [0:104911, 1:10022, 2:44695, 3:26323, 4:50937, 5:36393, 6:99643, 7:54777, 8:17366, 9:28681, 10:131557, 11:31438, 12:33617, 13:24153, 14:111447, 15:43643, ...]...), fractionLoaded:0.99938]
2016-08-18 17:17:18,867 INFO  ALSServingModelManager:104 Loading new model
2016-08-18 17:17:24,141 INFO  AbstractOryxResource:86 Model loaded fraction: 0.9996865
2016-08-18 17:17:24,460 INFO  ALSServingModelManager:115 Updating model
2016-08-18 17:17:49,084 ERROR ModelManagerListener:144 Error while consuming updates
java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.String
        at net.openhft.koloboke.collect.impl.hash.MutableSeparateKVObjLHashGO.removeIf(MutableSeparateKVObjLHashGO.java:275)
        at com.cloudera.oryx.app.serving.als.model.ALSServingModel.lambda$retainRecentAndKnownItems$7(ALSServingModel.java:437)
        at net.openhft.koloboke.collect.impl.hash.MutableLHashParallelKVObjObjMapGO$ValueView.forEach(MutableLHashParallelKVObjObjMapGO.java:2228)
        at com.cloudera.oryx.app.serving.als.model.ALSServingModel.retainRecentAndKnownItems(ALSServingModel.java:435)
        at com.cloudera.oryx.app.serving.als.model.ALSServingModelManager.consume(ALSServingModelManager.java:119)
        at com.cloudera.oryx.lambda.serving.ModelManagerListener.lambda$contextInitialized$1(ModelManagerListener.java:142)
        at com.cloudera.oryx.common.lang.LoggingCallable.lambda$log$0(LoggingCallable.java:48)
        at com.cloudera.oryx.common.lang.LoggingCallable.lambda$asRunnable$1(LoggingCallable.java:66)
        at java.lang.Thread.run(Thread.java:745)
2016-08-18 17:17:49,086 INFO  ModelManagerListener:177 ModelManagerListener closing
2016-08-18 17:17:49,086 INFO  ModelManagerListener:179 Shutting down model manager
2016-08-18 17:17:49,086 INFO  ModelManagerListener:184 Shutting down input producer
2016-08-18 17:17:49,086 INFO  Producer:68 Shutting down producer

@cimox and @flyingandrunning -- you say you have this same error? @cimox you commented that it only happens if data is in a wrong format, could you elaborate? Nicholas also has this error but I can't figure out how to reproduce it otherwise. We've looked at loads of theories.

Sure @srowen, I will try to reproduce it on our dev environment. I'll keep in touch.

Hi @srowen so I've talked with my colleague and he told me that we can share created model from HDFS with you, if it will help you. Basically we can share whole HDFS dir from project where this issue occurred. If you still need me to try reproduce it, I will give a try.

This may be related to #312 in that I believe Nicolas is no longer seeing the problem after this change. If you're able to try a build from branch, you can check it out. It'll be in the next release. Let's reopen if anyone still sees it though.

Hi @srowen, any news related to this issue? Can I help you somehow to fix this?

We've had a workaround, at least, for a long while. I think that's the resolution for the foreseeable future.

This is still an issue as seen in #353
@stiv-yakovenko has found a related Koloboke issue which is probably related: leventov/Koloboke#66

The two issues occur in different places, but have some clear similarities:

            if ((key = (E) keys[i]) != FREE) {
                if (filter.test(key)) {
                if (tab[i] != FREE) {
                    action.accept((V) tab[i + 1]);
                }

A value is checked against a marker object FREE, and if it's not the marker, is passed to a user function. FREE is an Object, not an E or V here but it doesn't matter after erasure. It does matter when passed to a function.

The issue is that neither of these checks for the other marker, REMOVED.

I don't see an obvious workaround. We can remove Koloboke for now or see if it can be fixed upstream.

Small update: I see that the code intends to never leave REMOVED in place after methods like removeIf are called. closeDelayedRemoved cleans those out. Either there is some bug there, or else still some concurrency issue in the caller here. I can't find any accesses that modify the state and aren't protected by a write lock; it's pretty straightforward code on this end.

Yes, the intention was not to have REMOVED elements, but something went wrong :)
I dont think this is concurrency problem because another person observed this bug in concurrency-free example.
If you want to rescue koloboke, you will have to create some sort of fuzzy load test that will a) find crashing pattern b) will give some guarantee that you have fixed problem without introducing new one. I'd remove this koloboke collection at all instead. 10% performance boost is not worth classcastexception.

Yes, that's for sure. That's unfortunate. I was hoping to find there's another way around it.

While we can hack the lambda functions we pass to methods like forEach to cope with these unexpected Objects, I don't think that helps calls to retainAll and so on.

It might be possible to do things like copy collections at key points instead of updating them. That doesn't sound great, but might still be better than foregoing Koloboke entirely. The memory impact of using regular collections is, IIRC, quite significant. There are other primitive collection libraries but none more maintained or better than Koloboke.

We can fork some code from Koloboke if needed, temporarily, to get a fix in. Do we know what the fix even is? It's easy to check for REMOVED in the loop but I don't think it's even supposed to be there, and may cause other issues. This is the most likely way to address this, and I'll have to find time later to look into it.

Well, koloboke is dead, author ignores this critical bug since March. You can use eclipse collections, based on Goldman Sachs implementation of collections, their benefit from memory/perfomance footprint seems to be comparable: https://github.com/eclipse/eclipse-collections