markovmodel / adaptivemd

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tutorial 5 not working

thempel opened this issue · comments

Using one's own generator class, i.e. for implementing a pyemma analysis script, does not work as described in notebook 5. First, their are several bugs in the notebook itself. Substituting the given my_generator.py by e.g. (just as a non-sense minimal example, important are the imports...)

from adaptivemd.generator import TaskGenerator
from adaptivemd.task import PythonTask

class MDTrajFeaturizer(TaskGenerator):
    def __init__(self):
        super(MDTrajFeaturizer, self).__init__()

    @staticmethod
    def then_func(project, task, data, inputs):
        # add the output for later reference
        project.data.add(data)

    def execute(self):

        t = PythonTask(self)
        t.call(
            my_script,
            param1='', 
            param2=''
        )

        return t
        
def my_script(param1, param2):
    return {"whatever you want to return"}

and, inside the notebook, registering it in the db

from my_generator import MDTrajFeaturizer
project.storage.update_storable_classes()

seems to fix some of the errors. If I'm not wrong, also the file itself needs to be added to the project, which I did with

File('file://my_generator.py').load()

Now, according to project.storage.simplifier.class_list, the class is registered. Still, when I start a worker with a task from this generator, I get

...
  File "/storage/mi/thempel/adaptive_MD_test/adaptivemd/adaptivemd/mongodb/dictify.py", line 250, in build
    obj['_cls'])
ValueError: Cannot create obj of class `MDTrajFeaturizer`.
Class is not registered as creatable! You might have to define
the class locally and call `update_storable_classes()` on your storage.

Any ideas? @marscher @jhprinz @euhruska

what is is the output of adaptivemd.mongodb.base.StorableMix.descendants()? I'm asking because update_storable_classes should include all children of StorableMixin.

The new class is in there.

from adaptivemd.mongodb.base import StorableMixin
StorableMixin.descendants()
[adaptivemd.mongodb.object.ObjectStore,
 adaptivemd.mongodb.file.DataDict,
 adaptivemd.file.Location,
 adaptivemd.file.Action,
 adaptivemd.generator.TaskGenerator,
 adaptivemd.task.BaseTask,
 adaptivemd.brain.Brain,
 adaptivemd.resource.Resource,
 adaptivemd.engine.engine.Frame,
 adaptivemd.engine.engine.OutputTypeDescription,
 adaptivemd.model.Model,
 adaptivemd.logentry.LogEntry,
 adaptivemd.worker.Worker,
 adaptivemd.mongodb.file.FileStore,
 adaptivemd.file.File,
 adaptivemd.file.JSONFile,
 adaptivemd.file.Directory,
 adaptivemd.engine.engine.Trajectory,
 adaptivemd.file.AddPathAction,
 adaptivemd.file.FileAction,
 adaptivemd.file.Touch,
 adaptivemd.file.MakeDir,
 adaptivemd.file.FileTransaction,
 adaptivemd.file.Remove,
 adaptivemd.file.Copy,
 adaptivemd.file.Transfer,
 adaptivemd.file.Link,
 adaptivemd.file.Move,
 adaptivemd.generator.PythonRPCTaskGenerator,
 adaptivemd.engine.engine.Engine,
 my_generator.MDTrajFeaturizer,
 adaptivemd.analysis.analysis.Analysis,
 adaptivemd.analysis.pyemma.emma.PyEMMAAnalysis,
 adaptivemd.engine.openmm.openmm.OpenMMEngine,
 adaptivemd.task.Task,
 adaptivemd.task.PrePostTask,
 adaptivemd.task.EnclosedTask,
 adaptivemd.engine.engine.TrajectoryGenerationTask,
 adaptivemd.task.MPITask,
 adaptivemd.task.DummyTask,
 adaptivemd.task.PythonTask,
 adaptivemd.engine.engine.TrajectoryExtensionTask,
 adaptivemd.resource.AllegroCluster,
 adaptivemd.resource.LocalResource]

Hi, perhaps I can help...

The way saving and loading of classes works is that we store only the necessary parameters and then create a new object using either the constructor or the custom method from_dict if recreation is more complicated.

This is a little different to the __setstate__ and __getstate__ approach from pickling. I usually assume immutable objects and hence using the exact same initial parameters much result in an identical object (besides the __uuid__)

Now, the class itself is not stored, just its name. And when you load a file then you need a list that maps names back to the classes. This list is stored in the storage.simplifier object which is of type StroableObjectJSON (I think or similar). This knows how to convert an object into a dict that can be saved using JSON.

The usual approach in the json package would be to register a special method for each type. I decided to use a custom method in the class that is to be saved instead!

The list of class name to real class is derived by searching for all existing subclasses of StorableMixin. This is one reason why you really need to subclass from this, even if you do not need any functionality like default from_dict. The other is the generation of the unique ID in __uuid__ so don't forget to use super.

The list is created when the storage.simplifier is created, which is when you create the Storage.

This part is clear (I presume from the detailed question).

When you now start a worker the same rules apply. You can only reconstruct classes that are loaded in your python instance. The worker does not import any files other than the ones in adaptivemd.

Solutions

  1. Add your custom generator class to the adaptivemd package. This was my initial idea. And if you just want to run a function externally I wrote the PythonTask helper to get around this limitation. It will indeed figure out, where the function is stored and copy the file to the remote machine or require the additional package to be loaded (provided it is installed). This makes sense if your generator is general enough and other people could profit from it too.

  2. Load the custom generator in the adaptivemdworker. Kind of like building a custom worker that can use this generator. I think that is not a nice solution.

  3. Add an option to the adaptivemd worker to import additional packages/files. This is a nice workaround a might be flexible. It still requires the user to install their own custom classes in files on the cluster. I think it makes sense to have this option anyway and it should be easy to implement.

  4. Similar to @thempel s initial idea to use .load(), but write a special file type that, when added to a project will be loaded from a worker automatically. That I think would be pretty cool and flexible.

Example
# custom source files
source = PythonSourceFile('file://my_generator.py').load()  # subclass of `File`
project.storage.files.add(pkg)  # save in project

The worker will initially scan for all PythonSourceFiles in the project and load these before initializing the project. All the logic for creation etc should already exist, so it is not too hard to implement.

I forgot. Will have a look at the tutorial to see what is wrong.

Hi, could we just try to get the notebooks working for now?

Will fix this this evening. Sorry!

Okay, I was just looking at this again. I think it needs to be clear that this notebook is not just executable and the file created is a stub, that needs to be completed witht the codesnippets provided in the notebook itself.

Then, it needs to be added to the installed project and also installed as well on the cluster where you want to try it. All of this is mentioned in the notebook. And the reason that you cannot just run the example and write your own class is also explained.

So, did you do all those steps and did it fail? Then I think we should fix this and see what the problem is. I also explained above what we would have to do to make the examples (theoretically) run without having to change the source of the package, but we should decide on that (see above).

I am happy to help to fix this.

I'm well aware that this is not a just-press-the-button notebook. I tried to set-up and run a minimal example class and script. I started from the first example in notebook 5 and added what I thought was missing but could not properly register it with the db.
Could you give a short line of code for what it means to
a) install it on the cluster and
[b) add the class to the project]?
I think this is where it failed for me.

From a didactic perspective, it would be great to have one completely working example which does run upon pressing the button. Maybe at the end of the notebook where people should have read the more theoretic part (by the way, it's great that this is dealt with in so much detail!). This is where potential users could start implementing their own stuff by copy-pasting and modifying.