livefront / bridge

An Android library for avoiding TransactionTooLargeException during state saving and restoration

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NPE in FileDiskHandler

shashachu opened this issue · comments

Hi,

We just upgraded to the latest version of Bridge and we're seeing this crash in production. No repro steps yet, but here's the callstack:

java.lang.NullPointerException: Attempt to read from field 'java.util.concurrent.Future f.n.a.j.b.b' on a null object reference
        at com.livefront.bridge.disk.FileDiskHandler.cancelFileLoading(FileDiskHandler:120)
        at com.livefront.bridge.disk.FileDiskHandler.clearAll(FileDiskHandler:70)
        at com.livefront.bridge.BridgeDelegate$1.run(BridgeDelegate:98)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
        at java.lang.Thread.run(Thread.java:919)

I could submit a PR for a simple null check, but maybe you have some deeper idea of what's going on.

I think it's due to us calling Bridge.clearAll without knowing there's any data to clear, so a null check is probably fine?

@shashachu Thanks for reporting this. I'll have to think about this one a bit because I can't see precisely how this could result in an NPE and I don't want to just add a null check until I understand it more. Any other information you might be able to provide essentially would be great.

By the way, you shouldn't call clearAll manually unless you're deciding to no longer use Bridge and just want to clear any possible user data moving forward. (I'm not sure if you were implying you were calling it manually or not but just wanted to mention that.)

Thanks @byencho. This was just from a few minutes of investigation so I could be wrong on the root cause. Looking a bit further into the code I agree it doesn't really seem possible for mPendingLoadFuture to be null. I'll check our proguard mapping to make sure I know which field it's complaining about.

BTW I updated the original bug with the actual exception message in case it's helpful.

@shashachu Sounds good, thanks!

And we do call clearAll in a couple cases - one is when the user logs out, the other is basically in a disaster recovery scenario where we detect the user is repeatedly crashing and we want to try to get them to a clean state.

Here's from proguard:

com.livefront.bridge.disk.FileDiskHandler -> f.n.a.j.b:
    java.util.concurrent.Future mPendingLoadFuture -> b

I wonder if it's some kind of race condition.

@shashachu Yeah my initial thought was that it could be a race condition but mPendingLoadFuture itself is final and set in the constructor so I don't see any scenario where you can have a FileDiskHandler with a null mPendingLoadFuture, unless executorService.submit returned a null value for some reason. But then that would be strange bug in ExecutorService that I would expect to be noted somewhere.

@shashachu Just out of curiosity, are there cases where you call Bridge.clearAll before having called Bridge.initialize?

@shashachu OK so I've got a little more information on this issue. Looking at the stacktrace again, it's not actually saying that mPendingLoadFuture is null, but that mFileDiskHandler is. It's very strange, but I can mostly reproduce it by adding the following code when clearing:

class BridgeDelegate {
    ...
    void clearAll() {
        mUuidBundleMap.clear();
        mObjectUuidMap.clear();
        
        // This is the new code that causes the crash
        doInBackground(new Runnable() {
            @Override
            public void run() {
                mDiskHandler = null;
                ((FileDiskHandler) mDiskHandler).mPendingLoadFuture.cancel(true);
            }
        });
        
        doInBackground(new Runnable() {
            @Override
            public void run() {
                mDiskHandler.clearAll();
            }
        });
    }
    ...
}    

With that code in place, I can trigger the following error when calling Bridge.clearAll:

    java.lang.NullPointerException: Attempt to read from field 'java.util.concurrent.Future com.livefront.bridge.disk.FileDiskHandler.mPendingLoadFuture' on a null object reference
        at com.livefront.bridge.BridgeDelegate$1.run(BridgeDelegate.java:100)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
        at java.lang.Thread.run(Thread.java:923)

It's not exactly the same but I think it's roughly the same idea. So I think the issue here is that because doInBackground calls mDiskHandler on a background thread and mDiskHandler is not final or volatile, there is some strange combination of circumstances that is allowing it to be null on the background thread. I'm pretty sure the fix here would be to simply make mDiskHandler (and a bunch of other member variables) final. I can't conclusively prove that would solve the issue at the moment but I think it's a possibility.

@byencho interesting find! I thought final simply does compile-time checks, in which case I don't think marking it final is enough. But I'm not enough of a Java expert to know if final does anything at runtime.

Re your question about whether it is possible for us to call clearAll prior to calling initialize, I do think it's possible, and looking at the code now, I'll bet that's actually what's happening:

If you call Bridge.clearAll prior to initialization, it initializes a new static BridgeDelegate since there is none. Then immediately after, it spawns the background task and in some cases calls mDiskHandler.clearAll() before the constructor has finished executing.

@shashachu So using final is going to do more than just compile-time checks. It actually makes accessing that field thread-safe, which is what we need in this case. So I'm going to make all member variables final that I can (which I should have done in the first place) and then make a release at some point which hopefully fixes your issue. If not we'll keep investigating. I have a couple more questions for you:

  1. What is the urgency on this fix? Does this happen very often?
  2. Are you calling Bridge.clearAll and / or Bridge.initialize on background threads? I'd still like to reproduce the actual crash if possible so any more information you could give on the sequence of these calls would be great.
  3. Is it possible for you to reverse the order of the calls and see if that fixes your issue? Calling Bridge.clearAll before calling Bridge.initialize was really only intended for cases where initialize was never going to be called so I'm not surprised weird issues are cropping up here.

@byencho

What is the urgency on this fix? Does this happen very often?

For now we've just rolled back to an older version of Bridge. Hard to get a grasp on overall volume because we rolled it back after seeing crashes in our 1% rollout, so overall app adoption was relatively low. We'd like to be on 2.x because it's one of the few remaining dependencies using the support library, but it's not an emergency.

Are you calling Bridge.clearAll and / or Bridge.initialize on background threads? I'd still like to reproduce the actual crash if possible so any more information you could give on the sequence of these calls would be great.

I can't see anywhere that we're calling them off the main thread; Bridge.initialize is called from onCreate and the couple calls to clearAll should all happen on the main thread.

Is it possible for you to reverse the order of the calls and see if that fixes your issue? Calling Bridge.clearAll before calling Bridge.initialize was really only intended for cases where initialize was never going to be called so I'm not surprised weird issues are cropping up here.

We probably could add another call to initialize; we were simply going off the documentation which stated it was safe to call without initializing first, so we called it unconditionally. in the case of calling initialize right before clearAll I might just as soon add a try/catch around it since we don't really intend to initialize it if it's not already initialized.

@shashachu OK well I would not recommend calling initialize more than once and using a try / catch. I think that is only going to cause more problems. If your clearAll calls are really decoupled then I suppose you should keep going with what you are doing. I have a hard time picturing how both initialize and clearAll can be called on the main thread and yet you do not know the ordering of them, though. Bridge.initialize should be called as nearly the first thing in Application.onCreate. It's very difficult to have any code that actually runs before that such that a call to clearAll can happen before it. I believe that sort of thing can only happen in a ContentProvider.

In any case, I'll look into the changes I've suggested and see if there is anything else I can do to try to prevent this issue. I would really think this would be a pretty rare occurrence, though.

@shashachu OK I've released v2.0.1 with some changes that I hope will help. I'm guessing you won't really know if its working until your next release after you update again but whenever that happens to be, please let me know if it solved your problem. I'm going to close this issue for now but please re-open it if you see the issue again.

Thank you! We'll give it a shot and report back.