google / ml-compiler-opt

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding new features under InlineModelFeatureMaps.h results in TF model pruner to remove them at deployment

amirjamez opened this issue · comments

It might be a Tensorflow bug or incompatibility amongst installed libraries, but here is the issue:

When we declare new features under llvm/include/llvm/Analysis/InlineModelFeatureMaps.h and define them at llvm/lib/Analysis/MLInlineAdvisor.cpp, it is added to the frozen model at each iteration of the trainer.

Also, when I load the frozen graph under model/policy/$ITERATION_NO/saved_policy/* using tf.saved_model.load("saved_model.pb"), the signature shows all the tensor names, including the newly added one, but when .local/lib/python3.6/site-packages/tensorflow/python/tools/saved_model_aot_compile.py does the _prune_removed_feed_nodes(signature_def, graph_def) :

def _prune_removed_feed_nodes(signature_def, graph_def):
  """Identify the inputs in the signature no longer in graph_def, prune them.

  Args:
    signature_def: A `SignatureDef` instance.
    graph_def: A `GraphDef` instance.

  Returns:
    A new pruned `SignatureDef`.
  """
  node_names = set([n.name for n in graph_def.node])
  new_signature_def = meta_graph_pb2.SignatureDef()
  new_signature_def.CopyFrom(signature_def)
  for (k, v) in signature_def.inputs.items():
    tensor_name, _ = _parse_tensor_name(v.name)
    if tensor_name not in node_names:
    ¦ logging.warn(
    ¦   ¦ 'Signature input key \'{}\', tensor name \'{}\', has been pruned '
    ¦   ¦ 'while freezing the graph.  Removing it from the compiled signatures.'
    ¦   ¦ .format(k, tensor_name))
    ¦ del new_signature_def.inputs[k]
  return new_signature_def

when building LLVM for deployment, it prunes those newly added ones that it does not find in graph_dev.node and as a result my final deployed model is incorrect:

1644209582.45: clang: /home/llvm-project/llvm/lib/Analysis/ReleaseModeModelRunner.cpp:62: {anonymous}::ReleaseModeModelRunner::ReleaseModeModelRunner(llvm::LLVMContext&): Assertion `Index >= 0 && "Cannot find Feature in inlining model"' failed.
1644209582.45: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.

Here are my other TF related packages:

MLGO commit: dac1b149a523b3271341ae72431484df215d8dd3

commit dac1b149a523b3271341ae72431484df215d8dd3 (origin/master, origin/HEAD, master)
Author: Mircea Trofin <mtrofin@google.com>
Date:   Thu Jan 21 11:04:00 2021 -0800

    Fix to demo OUTPUT_DIR

    (Thanks to @liyuqian for the fix)
tensorboard             2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.7.0
tensorflow              2.4.1
tensorflow-addons       0.11.2
tensorflow-estimator    2.4.0
tensorflow-probability  0.12.2
tf-agents               0.7.1
tf-estimator-nightly    2.4.0.dev2020102201

Also, 2.4.1 was used to create the model:

>>> imported
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x7f6bcc4f7a58>
>>> imported.tensorflow_version
'2.4.1'

Here is the debug output:

coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-02-06 21:56:34.286796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:34.290840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-06 21:56:34.290891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-02-06 21:56:34.293434: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-06 21:56:34.293855: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-06 21:56:34.296799: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-02-06 21:56:34.297732: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-06 21:56:34.297943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-06 21:56:34.299759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-06 21:56:34.299802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:35.075324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-06 21:56:35.075387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2022-02-06 21:56:35.075400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2022-02-06 21:56:35.078385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11119 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: $
000:82:00.0, compute capability: 6.0)
2022-02-06 21:56:35.096700: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2194810000 Hz
2022-02-06 21:56:35.236522: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] model_pruner failed: Invalid argument: Graph does not contain terminal node StatefulPartitionedCall_2.
2022-02-06 21:56:35.247615: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:928] Optimization results for grappler item: graph_to_optimize
  model_pruner: Graph size after: 38 nodes (-2), 48 edges (0), time = 0.987ms.
  implementation_selector: Graph size after: 38 nodes (0), 48 edges (0), time = 0.562ms.
  function_optimizer: Graph size after: 343 nodes (305), 581 edges (533), time = 22.39ms.
  common_subgraph_elimination: Graph size after: 303 nodes (-40), 541 edges (-40), time = 3.508ms.
  constant_folding: Graph size after: 227 nodes (-76), 387 edges (-154), time = 55.232ms.
  shape_optimizer: shape_optimizer did nothing. time = 0.41ms.
  arithmetic_optimizer: Graph size after: 238 nodes (11), 398 edges (11), time = 4.174ms.
  layout: Graph size after: 238 nodes (0), 398 edges (0), time = 5.838ms.
  remapper: Graph size after: 238 nodes (0), 398 edges (0), time = 1.459ms.
  loop_optimizer: Graph size after: 238 nodes (0), 397 edges (-1), time = 1.714ms.
  dependency_optimizer: Graph size after: 156 nodes (-82), 221 edges (-176), time = 3.391ms.
  memory_optimizer: Graph size after: 156 nodes (0), 221 edges (0), time = 6.835ms.
  model_pruner: Invalid argument: Graph does not contain terminal node StatefulPartitionedCall_2.
  implementation_selector: Graph size after: 156 nodes (0), 221 edges (0), time = 0.468ms.
  function_optimizer: function_optimizer did nothing. time = 0.127ms.
  common_subgraph_elimination: Graph size after: 146 nodes (-10), 211 edges (-10), time = 1.021ms.
  constant_folding: Graph size after: 146 nodes (0), 211 edges (0), time = 3.151ms.
  shape_optimizer: shape_optimizer did nothing. time = 0.133ms.
  arithmetic_optimizer: Graph size after: 146 nodes (0), 211 edges (0), time = 2.64ms.
  remapper: Graph size after: 146 nodes (0), 211 edges (0), time = 0.8ms.
  dependency_optimizer: Graph size after: 146 nodes (0), 211 edges (0), time = 1.752ms.

2022-02-06 21:56:35.281080: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-02-06 21:56:35.282047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:82:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-02-06 21:56:35.282083: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:35.282137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-06 21:56:35.282155: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-02-06 21:56:35.282172: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-06 21:56:35.282190: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-06 21:56:35.282208: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-02-06 21:56:35.282226: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-06 21:56:35.282243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-06 21:56:35.283971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-06 21:56:35.284299: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-02-06 21:56:35.285214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:82:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2022-02-06 21:56:35.285237: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-06 21:56:35.285258: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-06 21:56:35.285277: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-02-06 21:56:35.285295: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-06 21:56:35.285311: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-06 21:56:35.285328: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-02-06 21:56:35.285345: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-06 21:56:35.285363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-06 21:56:35.287113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-02-06 21:56:35.287143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-06 21:56:35.287153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2022-02-06 21:56:35.287161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2022-02-06 21:56:35.288959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11119 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from /home/llvm-project/llvm/lib/Analysis/models/inliner/variables/variables
2022-02-06 21:56:35.354948: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
WARNING:tensorflow:From /home/.local/lib/python3.6/site-packages/tensorflow/python/tools/saved_model_aot_compile.py:332: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.convert_variables_to_constants`
WARNING:tensorflow:From /home/.local/lib/python3.6/site-packages/tensorflow/python/framework/convert_to_constants.py:856: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
WARNING:tensorflow:Signature input key 'XXX', tensor name 'action_XXX', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'discount', tensor name 'action_discount', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'XXX', tensor name 'action_XXX', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'reward', tensor name 'action_reward', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'step_type', tensor name 'action_step_type', has been pruned while freezing the graph.  Removing it from the compiled signatures.
WARNING:tensorflow:Signature input key 'inlining_default', tensor name 'action_inlining_default', has been pruned while freezing the graph.  Removing it from the compiled signatures.
INFO:tensorflow:Writing graph def to: /tmp/saved_model_clilxj7nh6h/frozen_graph.pb
INFO:tensorflow:Writing config_pbtxt to: /tmp/saved_model_clilxj7nh6h/config.pbtxt
INFO:tensorflow:Generating XLA AOT artifacts in: /home/llvm-project/build/lib/Analysis

The original question was also posted tensorflow/tensorflow#54296, but I thought this repo would have the better audience. It would be nice to have some sort of a compatibility table among all these needed libraries, as I suspect it might be a mismatch between two. I also checked the https://github.com/google/ml-compiler-opt/blob/main/requirements.txt and its history, but it doesn't have that info available.

Thanks,
-Amir

It might be a Tensorflow bug or incompatibility amongst installed libraries, but here is the issue:

When we declare new features under llvm/include/llvm/Analysis/InlineModelFeatureMaps.h and define them at llvm/lib/Analysis/MLInlineAdvisor.cpp, it is added to the frozen model at each iteration of the trainer.

@yundiqian would know more about this, but this might help - adding to those just says "we expect these features in the model". There should be an accompanying change in compiler_opt/rl/inlining/config.py.

(well... we should add a guide for how to extend the feature set)

In addition to what @mtrofin mentioned, you will have to add a corresponding vocabulary file to compiler_opt/rl/inlining/vocab. If I remember correctly, the file is a 1000 bucket histogram for the feature values, which are used to normalize the inputs.

Oh, and we should have automation for that (like really soon) @kshiteejm

It might be a Tensorflow bug or incompatibility amongst installed libraries, but here is the issue:
When we declare new features under llvm/include/llvm/Analysis/InlineModelFeatureMaps.h and define them at llvm/lib/Analysis/MLInlineAdvisor.cpp, it is added to the frozen model at each iteration of the trainer.

@yundiqian would know more about this, but this might help - adding to those just says "we expect these features in the model". There should be an accompanying change in compiler_opt/rl/inlining/config.py.

(well... we should add a guide for how to extend the feature set)

Thanks. Yes, I already applied the necessary changes there.

In addition to what @mtrofin mentioned, you will have to add a corresponding vocabulary file to compiler_opt/rl/inlining/vocab. If I remember correctly, the file is a 1000 bucket histogram for the feature values, which are used to normalize the inputs.

Thanks. Maybe that could be the missing piece. Does that serve the purpose of quantizing/bucketizing the space to which the features are meant to move (rather than a continues space)?

I have difficulty drawing a connection between this and the reason why the added features (new tensors) are not in graph def when they are optimized via graph optimizer & pruner.

The vocabulary files are used to create a preprocessing layer for each feature in the model that normalizes the feature between 0 and 1.

If you do not provide vocabulary files for the new features, then the features are disconnected from the component of the graph that contains the observed output node (because the preprocessing layers do not exist), so the graph pruner may safely delete those nodes without changing observable computations. This causes your segfault in clang because the model does not contain the features you requested.

For some visibility into how this happens in the python side, check out the following functions/lines:

Fair warning, I'm not a tensorflow expert so I might have slightly incorrectly described how the graph pruning process works, but this is my mental model. @yundiqian will be able to correct anything I said here.

Hi Amir,

There are multiple reasons it can happen, two questions to identify the root cause:

  1. Is your training going on well? (is it interrupted very quickly after starting or it runs for a while? how many iterations numbers did you see under model/policy/$ITERATION_NO? Do you see a lot or you see only 0 there?) If it does not go well, can you paste the log it prints?
  2. what's the change you made to config.py?

Hi Amir,

There are multiple reasons it can happen, two questions to identify the root cause:

  1. Is your training going on well? (is it interrupted very quickly after starting or it runs for a while? how many iterations numbers did you see under model/policy/$ITERATION_NO? Do you see a lot or you see only 0 there?) If it does not go well, can you paste the log it prints?

I didn't see interruptions, however, the $ITERATION_NO were different on both cases, meaning the original features and the revised features. I guess for the original features it went on for around ~1700 and the revised features it was around ~600 and then stopped saving new $ITERATION_NO, although the overall iteration was going on for almost 500k iterations. Loss function kept going up and down a bit and I saw no reason to carry on the training.

  1. what's the change you made to config.py?

Adding new features under feature_keys and then implementing the necessary code at LLVM in order to collect the features.

The vocabulary files are used to create a preprocessing layer for each feature in the model that normalizes the feature between 0 and 1.

If you do not provide vocabulary files for the new features, then the features are disconnected from the component of the graph that contains the observed output node (because the preprocessing layers do not exist), so the graph pruner may safely delete those nodes without changing observable computations. This causes your segfault in clang because the model does not contain the features you requested.

For some visibility into how this happens in the python side, check out the following functions/lines:

Fair warning, I'm not a tensorflow expert so I might have slightly incorrectly described how the graph pruning process works, but this is my mental model. @yundiqian will be able to correct anything I said here.

So, what method do you guys suggest to generate the buckets a priori? Looking at a large set of value distribution for the added features and generate a 1000 bucket? obviously, this has an egg and chicken problem for new features.

So, what method do you guys suggest to generate the buckets a priori? Looking at a large set of value distribution for the added features and generate a 1000 bucket? obviously, this has an egg and chicken problem for new features.

I've a tool to generate buckets for all features (including any new features). It is very close to a release, hopefully in the next couple of days.

For generating buckets:

  1. You will have to execute the trace generator (https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/tools/generate_default_trace.py) on your repository of IR files with your version of LLVM to generate tfrecord files containing raw feature values.
  2. The bucket generator tool that I release takes these tfrecord files as input and generates vocabs for all features.

You can then point to these vocabulary files and hopefully that will resolve the above issues you are facing.

In the interim, I would recommend trying to add mock vocab files with some values (maybe 500 zeros followed by 500 ones, or counting 1...1000) to see if this resolves the issue with the compiled model in clang.

Hi Amir,
There are multiple reasons it can happen, two questions to identify the root cause:

  1. Is your training going on well? (is it interrupted very quickly after starting or it runs for a while? how many iterations numbers did you see under model/policy/$ITERATION_NO? Do you see a lot or you see only 0 there?) If it does not go well, can you paste the log it prints?

I didn't see interruptions, however, the $ITERATION_NO were different on both cases, meaning the original features and the revised features. I guess for the original features it went on for around ~1700 and the revised features it was around ~600 and then stopped saving new $ITERATION_NO, although the overall iteration was going on for almost 500k iterations. Loss function kept going up and down a bit and I saw no reason to carry on the training.

  1. what's the change you made to config.py?

Adding new features under feature_keys and then implementing the necessary code at LLVM in order to collect the features.

I think I understand where your issues come from. Your code repo hasn't been updated for a while (we no longer have 'feature_keys' now). In this old version, if it does not see the relevant vocab file for a certain feature (you added the feature to config.py but does not generate vocab for that), it's going to void this feature (it will still exists in saved_model.pb as you see, but it's going to disappear after you convert the model to AOT). This explains your observation perfectly.

For solutions,

  1. pull the repo, add the feature in observation_spec in rl/inlining/config.py
  2. generate the vocab (we will soon release a tool and also update the demo for instructions)
  3. train model

A few explanations about vocab:

The vocab files is here. This is used during training. what's currently in the repo is what we pre-produced for current features in LLVM. So of course it does not include what you newly added in LLVM. With the tool we will release soon, you will be able to generate something similar, but including the new features. (Don't worry about how, we will update the demo for instructions)

Almost features during training expect its corresponding vocab file, if it does not find it, 1) in the old repo, it voids this feature; 2) in the latest repo, it breaks.

Thank you all for the info. I'll give it a try.

I can confirm that the issue is resolved after adding the buckets for the newly added features. The AOT model pruner doesn't touch them anymore. It only costs me to retrain the RL agent again (still going on!).

@kshiteejm @jacob-hegna Meanwhile the generate_buckets tool becomes available, I can suggest two methods for those in need of one:

  1. If you know by heart your distribution of the data generated by the added features, you can simply design/leverage an existing distribution manually to fit them into your data and then bucketize it.
  2. If don't, this is simply a density estimation problem and you need to look into the default_trace and come up with one. I ma guessing this is what is being worked on anyways.

@amirjamez We released a tool today to generate your own bucket files. You can find the instructions at https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md#collect-trace-and-generate-vocab. Please let us know in case you have any further questions.

Thanks @kshiteejm. I'll give it a try.
Edited. Did try it and it worked fine. Generated buckets for all features, including those that I did not have already and were getting removed by the graph pruner later at deployment (reward.buckets, inlining_decision.buckets, and inlining_default.buckets).

Looking into the history https://github.com/google/ml-compiler-opt/tree/15ff9bfcfe5093f7e325a17a8d33b9db6b9e20f0/compiler_opt/rl/inlining/vocab, the latest commit made (8826749), the repo (https://github.com/google/ml-compiler-opt/tree/main/compiler_opt/rl/inlining/vocab) didn't have these three buckets in vocab. Maybe I am missing something or these shouldn't be added by the sparse_bucket_generator.py?

@kshiteejm I have got another issue, the loss is converging to almost zero, like literally zero in warmstart and that causes the train optimize model loss is returning as nan now. Any suggestions?

Edited. Removed those three newly added buckets mentioned above and it resolved the nan loss issue.

@amirjamez glad you got it working, those three additional features are not picked up during training in the current version of the code and it is intended that way. More details follow.

The sparse_bucket_generator.py tool generates a superset of the features that are actually used during training. The reward.buckets, inlining_decision.buckets, and inlining_default.buckets are not used during training (even though these are generated by the tool). Only buckets for features in observation_spec (https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/inlining/config.py#L30; inlining_default is ignored) are picked-up during training (https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/agent_creators.py#L113). The reward_spec feature (rewards.buckets) and action_spec feature (inlining_decision.buckets) are not picked-up during training. I hope this further clarifies things!

Thanks @kshiteejm for clarifying the newly added buckets.

So what was the reason that including these three made my warm_start loss close to 0 that eventually led to a nan in the training? Can you reproduce this issue on your end?

Not sure what you mean by including newly added buckets. Did you make edits to the code to do something more with these three buckets OR did you leave the code untouched and just modified the set of bucket files in compiler_opt/rl/inlining/vocab folder? This will help me reproduce the issue at my end.

So when I reran the warm_start script, having executed the sparse_bucket_generator.py which added three new buckets (reward.buckets, inlining_decision.buckets, and inlining_default.buckets) to my vocab, the loss reported (also can be seen in Tensorboard) dramatically dropped after a few iterations and by iteration ~1000 (out of 100 k default), it was pretty close to zero and eventually became zero. What I had been seeing before in previous iterations of training a warm_start model, was a fluctuation of loss between 0.18 and 0.02 at the end of the 100k stretch. So, at finetuning, the warm_start model is picked up as the starting point by --gin_bindings=train_eval.warmstart_policy_dir=\"$WARMSTART_OUTPUT_DIR/saved_policy\", and immediately, it reported the loss as nan as it carried on the training, I had to stop to see what has changed and when I removed those three buckets file from my vacab and redo the process, it happened to resolve my issue. Hope this was clear.

Did also another try with skipping the warm_start and directly training an optimize model, but as you see the loss was so low from the start and I don't think it the backprop learned anything at all (it kinda stayed around 0.00000X for around 200k iteration). Not sure what causes this vanishing gradients here.

Thanks,
-Amir

Hi Amir,

Did you pull the latest version of the whole repo? if so, what is the change you made to compiler_opt/rl/inlining/config.py?

I have some very direct hypothesis so I'd like to confirm these two points with you :)

No, unfortunately, I am still on the commit dac1b14 and only cherry-picked the sparse_bucket_generator.py.

Regarding the config.py (https://github.com/google/ml-compiler-opt/blob/dac1b149a523b3271341ae72431484df215d8dd3/compiler_opt/rl/config.py), I only added the names of my custom features to this feature_keys. I know there has been changed to the repo, but I don't understand how that could be the cause of this issue.

Looking forward to hearing your hypothesis :)
-Amir

ok, i think my hypothesis is correct then :) the solution is to pull the latest version and it will solve your problem

The reason you see this phenomenon is that with the old version of the code, the fake inlining_default feature is used as a real feature with your new generated vocab folder, which is exactly the same as the label, that causes the loss to be 0, then NAN due to numerical issues. With the new code, this problem will no longer exist because we prune out the inlining_default feature explicitly here https://github.com/google/ml-compiler-opt/blob/main/compiler_opt/rl/inlining/config.py#L92

Great. Thanks @yundiqian