RuntimeError: Unable to create link (name already exists)

Question

RuntimeError: Unable to create link (name already exists)

innat opened this issue 2 years ago · comments

@sayakpaul
I've tried to run conv-next from tf-hub but faced this error. Could you please take a look into this? The error can be reproduced with this script; with following addition:

mdckpt = tf.keras.callbacks.ModelCheckpoint(
    "model.h5", 
    monitor='val_accuracy', 
    verbose=1, 
    save_best_only=True,
    save_weights_only=True, 
    mode='max', 
    save_freq='epoch'
)

Also note that, the error can be addrssed by renaming the layer parameter perhaps? For examle

import uuid

# ref web
def handle_name_exist_issue(model):
    def unique_name():
        return uuid.uuid4().hex.upper()[0:10]

    def postprocess_weight_name(name):
        if len(name.split('/')) == 1:
            return f'{unique_name()}/{name}'
        elif len(name.split('/')) == 2:
            group, name = name.split('/')
            return f'{group}{unique_name()}/{name}'
        elif len(name.split('/')) == 3:
            group, name_1, name_2 = name.split('/')
            return f'{group}{unique_name()}/{name_1}/{name_2}'
   
    model._name = model._name + unique_name()
    for layer in model.layers:
        layer._name = layer._name + unique_name()
    for i in range(len(model.weights)):
        model.weights[i]._handle_name = postprocess_weight_name(model.weights[i].name)
    return model

with strategy.scope(): 
  model = get_model(MODEL_PATH)
  model = handle_name_exist_issue(model)
  model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])

history = model.fit(train_dataset, validation_data=val_dataset, 
                   epochs=EPOCHS, callbacks=[mdckpt])

By using handle_name_exist_issue, it solves on Colab (GPU: TF 2.8). But not on Kaggle (TPU TF 2.4.1). I didn't test on Colab TPU and not other TF versions.

Have you faced such issues with these models? I also tried other TF-Hub model, they work as expected. However, any suggestioin for general solutions to work with conv-next-hub models on TPU?

Sayak Paul · Answer 1 · Sun Aug 28 2022 18:41:11 GMT+0800 (China Standard Time)

Better raised with the TF Hub team.

Mohammed Innat · Answer 2 · Sun Aug 28 2022 19:12:29 GMT+0800 (China Standard Time)

Thanks for the suggestion.
I was reaching out because you are the main contributor of tf-hub-conv-next models and the scripts you shared to this repo can be used also to reproduce the error. It's not the tf-hub issue generally because other tf-hub models works fine, for example: efficientnet-v2. Anyway, I will re-posting convnext mdoel's issue to TF Hub team to solve it.

Sayak Paul · Answer 3 · Sun Aug 28 2022 19:56:29 GMT+0800 (China Standard Time)

Usually, if you are using TPUs the cache directory for TF Hub needs to be a GCS bucket. But you mentioned EfficientNet is working, so I am not very sure.

Here's an example where I used TPUs and TF-Hub models: https://github.com/sayakpaul/FunMatch-Distillation/blob/main/train_bit.ipynb.

Mohammed Innat · Answer 4 · Mon Sep 05 2022 14:04:03 GMT+0800 (China Standard Time)

@sayakpaul
I think I didn't explain well the exact issue. So, I prepare a reproducible code. Could you please check?
https://colab.research.google.com/drive/1t4_V03l9nv0cx_Kxm0gEHXiZoVftXwtO?usp=sharing

Sayak Paul · Answer 5 · Mon Sep 05 2022 14:05:22 GMT+0800 (China Standard Time)

I am sorry, but I won't be able to look into it as I am currently a bit overwhelmed at work.

Mohammed Innat · Answer 6 · Mon Sep 05 2022 14:06:59 GMT+0800 (China Standard Time)

Of course, please take your time. :)