Some issues when combining `deep_architect` and `ray.tune`

Question

Some issues when combining `deep_architect` and `ray.tune`

iacolippo opened this issue 4 years ago · comments

Hi, first of all, I'd like to thank you for building and releasing deep_architect.

I am opening this issue because I'd like to use deep_architect together with ray.tuneto get the best of both worlds, but I encountered some issues. Feel free to close this if you think it is out of the scope of the project.

My goal is to use the sampling capabilities of deep_architect and the tools for multiprocessing and logging of ray and ray.tune. Therefore I'm using tune.run and tune.Trainable with the searchers, helpers and modules of deep_architect.

If I write my code with the call to the sampling function inside the _setup method of a tune.Trainable

https://gist.github.com/iacolippo/1262c8afbfd9f5e491add5fbae105afa (line 124)

then I have an issue with ray(tensorboard) logging. I'd say this is not an issue of deep_architect, and it shouldn't be too hard to fix in the source code of ray if need be.

If I write my code as ray wants it (the config["model"] is the model object, in this case, a PytorchModel from deep_architect), then I have a different error.

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

https://gist.github.com/iacolippo/3f815fa90c254f7a065bdc446406233a (not that the () disappeared at line 124)

This might be an issue with deep_architect and multiprocessing, or Pytorch itself, I don't know, I didn't dig into it too much for lack of time. Here is the traceback.

traceback.log

I am using

-e git+git@github.com:negrinho/deep_architect.git@3427c5d45b0cbdc9c2fe1f4e5213f6961ef41749#egg=deep_architect
ray==0.8.4
torch==1.5.0
torchvision==0.6.0a0+82fd1c8

Stay safe!

Richard Liaw · Answer 1 · Sat Apr 25 2020 08:31:58 GMT+0800 (China Standard Time)

Hey there - this seems like a problem with Ray's documentation being unclear.

What if you just did:

class SimpleClassifierTrainable(tune.Trainable):
    def _setup(self, config):
        use_cuda = torch.cuda.is_available()
        self.device = torch.device("cuda" if use_cuda else "cpu")
        self.batch_size = config["batch_size"]
        self.learning_rate = config.get("lr", 0.01)
        self.train_loader, self.val_loader = get_dataloaders(self.batch_size)
        ##############################
		# CREATE MODEL HERE
        model = sample_model(in_features=784, num_classes=10))
        self.model = model.to(self.device)
        ###############################
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(self.model.parameters(),
                                    lr=self.learning_rate)

Renato Negrinho · Answer 2 · Wed Sep 09 2020 23:30:23 GMT+0800 (China Standard Time)

Hi Iacopo. Apologies for the delay. Unfortunately, I haven't been able to dedicate much time to the DeepArchitect lately, but I'm looking to resume soon. I'm curious about whether how far did you go with DeepArchitect in your work. I'm not familiar with Ray but happy to integrate some functionality as it seems widely adopted now. I don't see any inherent problems in using DeepArchitect with Ray, provided that Ray does not need too much information about the workload that it is running (e.g., the exact architecture).

Iacopo Poli · Answer 3 · Thu Sep 10 2020 15:59:41 GMT+0800 (China Standard Time)

Hi @negrinho

No need to apologize :-) didn't have much time to work on this either.

The idea would be to be able to use DeepArchitect functions as a sampler for a Pytorch (or TF) model in the tune.run parameter config (see here: https://gist.github.com/iacolippo/3f815fa90c254f7a065bdc446406233a#file-ray_deep_architect_ex2-py-L201). This would make it really easy to scale an architecture search from a single machine to a cluster.

I will have a person working on a closely related project starting in October, so I will hopefully be able to give more detailed information soon.

Renato Negrinho · Answer 4 · Wed Nov 11 2020 20:49:39 GMT+0800 (China Standard Time)

Your example is pretty clear. Thanks for sharing. This is something that I would be interested in including in the repo (at the very least, an example). I'm curious about how the ongoing project is doing and if there are questions that I would be able to help with. I would be happy to help in whatever way I can.