Running own TensorFlow model on Android gives native inference error: “Session was not created with a graph before Run()!”

Question

Running own TensorFlow model on Android gives native inference error: “Session was not created with a graph before Run()!”

abdoelali opened this issue 8 years ago · comments

I was able to run the Inception-v3 model on Android just fine, and I now want to run my own trained TensorFlow model on Android. I'm following the approach from TensorFlow's image recognition tutorial and the Android TensorFlow demo, and adapting as necessary. My changes include: (a) integrating Android OpenCV as part of the bazel build (b) using own model and label file and (c) adjusting parameters (img_size, input_mean, input_std, etc.) accordingly.

From Android logcat, running my model with the tensorflow android demo app gives:

E/native: tensorflow_inference_jni.cc:202 Error during inference: Invalid argument: Session was not created with a graph before Run()!
...
E/native: tensorflow_inference_jni.cc:159 Output [output/Softmax:0] not found, aborting!

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

Own (duplicate) SO thread: http://stackoverflow.com/questions/40555749/running-own-tensorflow-model-on-android-gives-native-inference-error-session-w

Environment info

OS X Yosemite (10.10.5), LGE Nexus 5 (Android 6.0.1), Android SDK 23, Android OpenCV SDK 23, Bazel 0.4.0.

Steps taken

Saved own model's checkpoint (.ckpt) and graph definition (.pb) files separately using tf.train.Saver() then tf.train.write_graph()
Froze graph using freeze_graph.py (using bazel), gives 227.5 MB file
Optimized the graph using optimize_for_inference.py (additionally tried strip_unused.py)
Copied frozen, optimized, or stripped graph to android/assets
Doubled the total byte limit using coded_stream.SetTotalBytesLimit() in jni_utils.cc to handle my large model size
Built the tensorflow android app using bazel
Installed on android device using adb and bazel

As a sanity check, I have tested my model in C++ built with bazel following the tutorial here label_image, and my model correctly outputs a prediction. I have also tried playing with the order by which I save my graph def and checkpoint files before freezing, but no change.

Any help would be great.
cc @drpngx @andrewharp

Andrew Harp · Answer 1 · Mon Nov 14 2016 08:07:01 GMT+0800 (China Standard Time)

@abdoelali Are you certain that output/Softmax:0 is correct node for your graph? Possibly it's capitalized differently?

Abdallah El Ali · Answer 2 · Mon Nov 14 2016 16:00:16 GMT+0800 (China Standard Time)

@andrewharp Yes, pretty sure that output/Softmax is the correct node name, and output/Softmax:0 the correct tensor name, as seen through the logs from graph.get_operations(). I've also tested both just in case, but it seems the error is occurring before (at Session was not created with a graph before Run()!).

The funky naming is due to using TFlearn to train my network, where the final FC layer is called in Python as: fully_connected(self.network, len(CLASSES), activation = 'softmax', name='output'). An export of my graph is here.

Also, this is the latest tensorflow commit I'm on: Mon Oct 31 09:15:24 2016 -0700 3ccdb2b695586201cde65e079806c5941ae542b6

Andrew Harp · Answer 3 · Tue Nov 15 2016 03:28:40 GMT+0800 (China Standard Time)

You may be running into #5111. The simple workaround is to not compress the pb in your apk by adding nocompress_extensions = [".pb",], to the android_binary target. Updating the pb loading logic for a real fix is on my TODO list.

If that doesn't fix it, can you paste the adb logcat of a full run? It could also be that it's either not finding the graph or TF isn't being built with all the required operators.

adb logcat *:E native:V tensorflow:V should catch everything relevant.

Abdallah El Ali · Answer 4 · Tue Nov 15 2016 17:26:32 GMT+0800 (China Standard Time)

Adding no compression for .pb extensions seems did not help. Full run (using optimized_cnn_graph.pb) of adb logcat here, and android tf BUILD here. Tested also on stripped_inception_graph.pb.

Abdallah El Ali · Answer 5 · Tue Nov 15 2016 19:02:16 GMT+0800 (China Standard Time)

Further sanity check: trained and ran a test 32-ResNet model (10e,128bs), with resultant stripped model size of 8.3MB. Same error occurs, so doesn't seem to depend on .pb size. ResNet graph def here and accompanying adb logcat here.

Beomjun Shin (Ben) · Answer 6 · Fri Nov 18 2016 16:36:30 GMT+0800 (China Standard Time)

@abdoelali Check this line tensorflow_inference_jni.cc:131 Creating session in android logcat message .There is a possibility that the graph has not been created correctly.

Abdallah El Ali · Answer 7 · Fri Nov 18 2016 18:26:29 GMT+0800 (China Standard Time)

@shastakr I can't find tensorflow_inference_jni.cc:131, but I noticed that for the ResNet graph, it was failing because of tensorflow_inference_jni.cc:133 Could not create Tensorflow Graph (which has to do with batchnormalization), but not so for earlier CNN model. Currently testing whether models implemented in Keras also result in the same error.

Beomjun Shin (Ben) · Answer 8 · Tue Nov 22 2016 14:58:49 GMT+0800 (China Standard Time)

If there exists tf.cond (which is switch for train/validation at the same code), it cannot be correctly converted. So I change tf.cond to python if-else (like tf.contrib.layers.batch_norm's smart_cond) for my own batchnorm layers and it works. It is difficult to deal with batchnorm.

Michael F. · Answer 9 · Sat Dec 17 2016 00:43:16 GMT+0800 (China Standard Time)

I had the same issue. Cost me several hours to find that out, but finally solved it. Maybe you did the same error.

I changed the modelfile to:

private static final String MODEL_FILE = "my_frozen_graph.pb";

assetManager.open call said it can open and read the file and tensorflow reported a success (0) and no exception and no debug message when calling inferenceInterface.initializeTensorflow(assetManager, modelFilename)

So I wrongly assumed that loading worked. It was no error with the input and output naming or the pb file (frozen with the freeze python script) it simply was tensorflow not finding the pb file but giving no error message.

THE SOLUTION was to change the model file to a path that the assetManager.open call does not find (on my phone) but tensorflow does.

private static final String MODEL_FILE = "file:///android_assets/my_frozen_graph.pb";

A suggestion for tensorflow would be to improve the API to correctly report, if loading worked or if e.g. the file could not be found.

Abdallah El Ali · Answer 10 · Tue Dec 20 2016 18:52:36 GMT+0800 (China Standard Time)

@shastakr I'm not really working with ResNet's, so I'm leaving all matters of batch normalization alone for now

Abdallah El Ali · Answer 11 · Tue Dec 20 2016 18:55:46 GMT+0800 (China Standard Time)

@penguinmenac3 This can't be it, as the .pb file is certainly being read. I know this because: (a) the out of the box inception graph file works correctly (and is placed in the assets dir) and (b) changing my frozen graph file name to one that doesn't exist causes a fatal error.

So, still stuck on:

tensorflow_inference_jni.cc:202 Error during inference: Invalid argument: Session was not created with a graph before Run()!

Ivan · Answer 12 · Wed Dec 21 2016 07:57:06 GMT+0800 (China Standard Time)

I also have the same issue. I posted it on stackoverflow.

What I have noticed in tensorflow_inference_jni.cc is that the GraphDef is instantiated and then the model is read with ReadFileToProtoOrDie function. I printed the node count after and before it and it is 0, so in my case, there is a problem with this function.

Maybe there is a problem on how the pb parsing is being made. For creating my model I used Anaconda with tensorflow 0.12.0rc and protobuf 3.1.0 installed with pip but I built the Android TF Lib from source code.

Andrew Harp · Answer 13 · Thu Dec 22 2016 03:54:51 GMT+0800 (China Standard Time)

@abdoelali I noticed this line in your logcat dump which seems to explain the error you're seeing:
11-15 11:50:15.011 17053 17053 E native : tensorflow_inference_jni.cc:133 Could not create Tensorflow Graph: Invalid argument: Input 0 of node BatchNormalization/cond/AssignMovingAvg_1/Switch was passed float from BatchNormalization/moving_variance:0 incompatible with expected float_ref.

Andrew Harp · Answer 14 · Thu Dec 22 2016 04:05:35 GMT+0800 (China Standard Time)

@ivancruzbht Still not sure what's going on in your case. Though 7ms seems very fast to initialize (it takes me 70ms to initialize with a .5mb graph), so it probably is just failing to load the graphdef properly. What is the size of your file? Is it binary or text?

@penguinmenac3 Yes, we should do a better job of failing fast and reporting initialization errors clearly -- will look into improving this.

Ivan · Answer 15 · Thu Dec 22 2016 04:22:16 GMT+0800 (China Standard Time)

@andrewharp I just found out the issue. It seems that ReadFileToProtoOrDie in jni_utils.cc needs the prefix "file:///android_asset/" in the file name to use the AssetManager to open the GraphDefFile. Otherwise it will try to open the file standalone. For some reason not using the AssetManager does not open correctly the pb file. Then I had an issue in my Gradle script that was not executing the non-compression of the pb file. Once I fixed that I could load the Graph.

Maybe a couple of "File Not Found" or "AssetManager cannot find the file" logs would improve the initialization :)

Andrew Harp · Answer 16 · Thu Dec 22 2016 04:34:24 GMT+0800 (China Standard Time)

@ivancruzbht Right -- if you leave off the prefix it just treats the location as a regular filepath on device (e.g. you could put your file in /sdcard/ somewhere -- there's no reason that PBs need to be bundled with the app in assets).

I suppose the assumption was that proto->ParseFromCodedStream would return false if the file wasn't found, and then the CHECK on ReadFileToProtoOrDie would catch that. That's obviously not the case, so I'll look into making it more robust.

Abdallah El Ali · Answer 17 · Thu Dec 22 2016 17:54:36 GMT+0800 (China Standard Time)

@andrewharp the tensorflow_inference_jni.cc:133 error might explain my ResNet experiment (which indeed gives batch normalization error), but not my CNN experiment I initially linked to (this logcat).

Abdallah El Ali · Answer 18 · Thu Dec 22 2016 17:59:22 GMT+0800 (China Standard Time)

@ivancruzbht Also have a StackOverflow question open.

Will try printing the node count and inspect the ReadFileToProtoOrDie in jni_utils.cc.

Andrew Harp · Answer 19 · Fri Dec 23 2016 02:03:55 GMT+0800 (China Standard Time)

I've updated ReadFileToProtoOrDie internally to help it live up to the latter part of its name in the case of pb files not being found. Also added a check that the resulting GraphDef has > 0 nodes to catch anything else slipping through. Should be getting pushed to GitHub soon, so hopefully that will help discriminate between loading errors and actual TF issues.

Andrew Harp · Answer 20 · Wed Dec 28 2016 05:03:16 GMT+0800 (China Standard Time)

With c8abaee
to check for missing files and
5220461 to check for 0 node counts in the GraphDef, most initialization errors should now be caught much earlier.

@abdoelali If you sync again does it still give you the same error? You're also getting a suspiciously fast 7ms initialization time so I'm assuming it was a problem loading the graph.

Andrew Harp · Answer 21 · Thu Jan 05 2017 06:54:12 GMT+0800 (China Standard Time)

Additionally a non-zero return code will throw a RuntimeException with c972217, so I think it's unlikely any initialization problems can throw inference errors at this point..

@abdoelali Did you resolve your issue?

Abdallah El Ali · Answer 22 · Thu Jan 05 2017 23:12:49 GMT+0800 (China Standard Time)

@andrewharp Ok, some progress!

If I sync with 5220461 and c972217, but not c8abaee, my model crashes with following logcat:

01-05 15:48:28.062 11639-11639/? I/native: tensorflow_inference_jni.cc:82 Creating new session variables for 8dc6c085059a97b8 01-05 15:48:28.062 11639-11639/? I/native: tensorflow_inference_jni.cc:105 Loading Tensorflow. 01-05 15:48:28.062 11639-11639/? I/native: tensorflow_inference_jni.cc:107 Making new SessionOptions. 01-05 15:48:28.062 11639-11639/? I/native: tensorflow_inference_jni.cc:110 Got config, 0 devices 01-05 15:48:28.064 11639-11639/? I/native: tensorflow_inference_jni.cc:114 Session created. 01-05 15:48:28.064 11639-11639/? I/native: tensorflow_inference_jni.cc:117 Graph created. 01-05 15:48:28.067 11639-11639/? I/native: tensorflow_inference_jni.cc:121 Acquired AssetManager. 01-05 15:48:28.067 11639-11639/? I/native: tensorflow_inference_jni.cc:123 Reading file to proto: file:///android_asset/cnn_frozen_graph.pb 01-05 15:48:28.067 11639-11639/? A/native: tensorflow_inference_jni.cc:126 Check failed: tensorflow_graph.node_size() > 0 Problem loading GraphDef! 01-05 15:48:28.067 11639-11639/? A/libc: Fatal signal 6 (SIGABRT), code -6 in tid 11639 (tensorflow.demo)

However, if I update jni_utils.cc with c8abaee, then (I guess) my graph loads, though incorrectly classifying (which can be another issue). To be sure, latest working logcat here. Also, looking at logcat, shouldn't it print "TF init status != 0"?

Note: I'm now compiling using newer bazel: Build label: 0.4.3-homebrew

Andrew Harp · Answer 23 · Fri Jan 06 2017 01:09:14 GMT+0800 (China Standard Time)

@abdoelali It's confusing why c8abaee by itself would cause your graph to load, as it strictly adds more checks to the initialization. It was the first of that series of commits, though, so my guess would be that you're regressing the entire tree to that point rather than cherry-picking commits to add?

Also it appears the code you have on the Java side of things is somewhat outdated, as you still have TF initialized messages coming from CameraConnectionFragment (which did not do any checking of return status). If at all possible I'd suggest syncing the entire tree to the latest codebase to make diagnosing the problem easier.

Abdallah El Ali · Answer 24 · Fri Jan 06 2017 17:41:28 GMT+0800 (China Standard Time)

@andrewharp Pleased to say that all is now working! Steps taken: full sync with latest tf build (last commit: 798ae42) and built using bazel -v 0.4.3-homebrew. Android SDK v23 and NDK v23. Stripped off all preprocessing (i.e., removing OpenCV from build), and just replaced my model and label file with out of box tf inception demo. Logcat now shows:

01-06 10:27:39.048 26344-26344/? I/TensorFlowImageClassifier: Reading labels from: mylabels.txt 01-06 10:27:39.049 26344-26344/? I/TensorFlowImageClassifier: Read 7, 7 specified 01-06 10:27:39.049 26344-26344/? I/native: tensorflow_inference_jni.cc:97 Native TF methods loaded. 01-06 10:27:39.049 26344-26344/? I/TensorFlowInferenceInterface: Native methods already loaded. 01-06 10:27:39.049 26344-26344/? I/native: tensorflow_inference_jni.cc:85 Creating new session variables for 25a68072eeb0d05 01-06 10:27:39.049 26344-26344/? I/native: tensorflow_inference_jni.cc:113 Loading Tensorflow. 01-06 10:27:39.053 26344-26344/? I/native: tensorflow_inference_jni.cc:120 Session created. 01-06 10:27:39.053 26344-26344/? I/native: tensorflow_inference_jni.cc:126 Acquired AssetManager. 01-06 10:27:39.053 26344-26344/? I/native: tensorflow_inference_jni.cc:128 Reading file to proto: file:///android_asset/cnn_frozen_graph.pb 01-06 10:27:39.680 26344-26344/? I/native: tensorflow_inference_jni.cc:132 GraphDef loaded from file:///android_asset/cnn_frozen_graph.pb with 40 nodes. 01-06 10:27:39.680 26344-26344/? I/native: stat_summarizer.cc:38 StatSummarizer found 40 nodes 01-06 10:27:39.680 26344-26344/? I/native: tensorflow_inference_jni.cc:139 Creating TensorFlow graph from GraphDef. 01-06 10:27:39.713 26344-26344/org.tensorflow.demo I/native: tensorflow_inference_jni.cc:151 Initialization done in 664.038ms 01-06 10:27:39.714 26344-26344/org.tensorflow.demo I/tensorflow: ClassifierActivity: Sensor orientation: 90, Screen orientation: 0 01-06 10:27:39.714 26344-26344/org.tensorflow.demo I/tensorflow: ClassifierActivity: Initializing at size 640x480

My predictions appear to be incorrect (since currently no preprocessing), but that's a different issue. Thanks again!

Andrew Harp · Answer 25 · Fri Jan 06 2017 22:50:31 GMT+0800 (China Standard Time)

Glad to hear!