markovmodel / adaptivemd

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simulation / Model / Analysis Workflow

jhprinz opened this issue · comments

@nsplattner @thempel and me were discussing the general recommended flow of data and this is somehow related to the questions in #23 .

Question

In #23 I asked about the structure of reduced trajectories (multiple-files...) so that PyEMMA or another analysis can be always used. Now, we the new directory approach there is no fixed structure and hence the framework cannot guess, what do to with the trajectories. Which files in the directory to use, etc... Still, the Trajectory objects, will have information about strides, but I guess that an engine will subclass from Trajectory to add certain information that is needed for restart with the engines particular way of storing things.

It means, that an engine writing the files in a trajectory folder, could also add information about filenames, etc to the Trajectory object it returns. We could agree that an engine needs to provide functions or bash snippets to extract a frame from such a trajectory. A trajectory know its generating engine and so a trajectory would have access to code that can extract frames, etc.

So, in theory it would be possible to write the trajectory analysis independent of the engine that generated the data. That was my original approach, but I guess that this will not reflect the way, things are currently done by people. Everyone wants something specific for whatever reason and so to trivial solution is, that

1. Trajectory generation and trajectory analysis goes in pairs.

You need to pass exactly the files PyEMMA needs and tell PyEMMA about the stride, etc you used to generate these. The downside is that code becomes less reuseable and hence easier to screw up.
This is easy because everyone writes their own code.

2. Write engine specific functions to read trajectories into pyemma

Hmmm, that would mean to add analysis specific code to the engine and I would really like to keep these separate. Still, it could make sense to have functions that allow you get certain files for certain aspects

t = Trajectory(...)
reduced_traj = engine.get_reduced(t)  # find the file for the reduced traj
full_traj = engine.get_full(t)  # find the file for the full traj

# this would be the normal way and you need to know `reduced1.dcd` as filename
reduced_trajs_for_analysis = project.trajectories.all.to_path('reduced1.dcd')

3. Use feature trajectories

This is what we discussed and could make sense. Instead of re-writing the PyEMMA input you need to write an engine specific featurizer which could be much simpler. It will also cache features for all trajectories. Useful, if these are expensive to compute but cheap to store.

It requires an intermediate featurization step, but then you just pass featurized trajectories to PyEMMA

In this approach we still need to figure out on where to store the feature_trajs. Could be in the trajectory folder, since this needs to exists before you can compute features.

IDEAS?

With the output dictionary I just suggested in #23, shouldn't it be possible to call an adaptive simulation by running the pyemma analysis on a specific trajectory type, e.g. modeller.taks_run_msm_files(list(project.trajectories['protein']), ...)? It would need to back-map the stride to the full-atom-trajectories which should be marked as such (maybe by calling them master:True in the output dictionary).

I would suggest to use feature trajectories and engine-independent analysis options. Since this will be done with MDtraj and PyEMMA, there is no reason to make it engine-dependent. For storing the feature trajectories I suggest to simply generate feature directories, analogous to the trajectory directories. The feature trajectories should be stored such that it is clear from the name a) what the original trajectory was and b) that its a feature trajectory. Since there are many possible feature trajectories for each dataset, the names can be generic (e.g. feature_1, feature_2...)

For the adaptive sampling the only thing to consider here is that the featurization task needs to be carried out when new data is available, so it should be part of the adaptive loop.

I would suggest to use feature trajectories and engine-independent analysis options.

Well, I thought about that, but I would not hard code that. A trajectory is already a feature trajectory and so why not use these if computing of features is fast and storing of the features is very costly.

In #28 the approach is the following:

a trajectory object is a reference to an engine and a folder that contains aset of trajectory-like files. The engine knows basic properties of these like stride, sub atoms etc. All of the outputs are generic for all engines. In what way you specify the sub atoms does not matter, only that there are subatoms and full ones. and some of these have a stride. The file format is arbitrary, too.

So what we could do (not yet implemented) is to allow to add also feature trajectories to these output formats. Then one exmaple trajectory folder could look like this

0000001/
  restart.npz  # the restart to continue. Contains full coords, velocities, etc...
  master.dcd  # `master`: stride 100, full atoms
  protein.dcd  # `protein`: stride 10, only protein
  feature-torsions.npy  # `tors`: a pyemma generated feature trajectory with torsions
  feature-backbone.npy  # a pyemma generated feature trajectory with c_alpha intercoms dists

All of that works already including taking care of correct strides patching trajectories together etc.

All of the trajectories have a name associated to reference them (master, protein, tors etc...)
and then you can select one of these to be used in pyemma. So if you select 100 of these trajectories and pick protein then pyemma will be told to load all 100 protein.dcd files.

in the current case you can then (on top) use a featurizer to your liking. E.g. take all backbone torsions. This of course would not make sense for feature trajectories, but still. You get the idea.

Now, if you would implement feature trajecories these can only be used as is in pyemma.

Last thing. About ading features on the fly. We could add the possibility to update a trajectory to have more features. But it would involve rerunning all existing trajectories. That will be a lot of tasks, but there is no conceptional problem. You create the feature as you would after a normal traj run and then just replace the old trajectory file with an updated one. Like you do when you extend a trajectory. That is not too difficult.

Example

engine.add_output_type(...)
feat = {some (pyemma) featurizer description}
engine.add_feature(feat)
task = engine.run(trajectory)  # will produce all you want

# updating all existing trajectories would work like this
tasks = project.trajectories.all.add_feature(feat)
project.queue(tasks)

We could even say that a Featurizer is a general task generator to add a feature traj to a traj. That could be PyEMMA or something else... That would be the most general approach I can think of.

I think the case to consider here is not so much when computing features is fast. If you work with larger systems/datasets and you want to work with residue minimum distance pairs or contacts, calculating the features becomes very costly and you would prefer storing them. I would not say storing the features is costly; it takes up some additional disk space, but for small projects its not an issue and for large projects this avoids hours or days of recomputing all features.

If you allow adding feature trajectories to the output format this would solve the problem.

I'm not sure from the statement above why calculating features should not be engine-independent. This is not something related to MD codes, but rather depends on the analysis tools (mdtraj or PyEMMA). Or are you thinking about different output trajectories of the MD engine, such as trajectories with different selections and strides? This can be done in the MD code in many cases, but it can also be done using mdtraj in all cases, so using mdtraj it could be implemented generally.

Look, all I am saying is that in cases I have seen, the feature trajectories are way larger than the plain trajectories, facter 2 or even more. Reading from disk is slow and so in that case reading the trajectory and computing the features in memory is much faster and saves lots of disk space.

but for small projects its not an issue and for large projects this avoids hours or days of recomputing all features.

Well, it depends on the type of features. I have seen both cases. Contact maps yes. Other simple ones, maybe not.

So why strictly saying we always have to use intermediate feature trajectories? Even for small projects where it probably does not matter. I don't understand why you would insist on that?

I said that it makes sense to use this as an option by using feature trajectories which is conceptionally almost what we have now, just add an additional output type format (instead of dcd or etc add numpy feature...) and we have both choices. Use trajectories and compute feature along the way or do it in two steps...

This seems optimal to me. Forcing intermediate feature trajectories seems overly complicated for simple systems.

And of course are the output types not engine specific. They just state: This file is a dcd will stride x and full atoms. This file is a numpy array with these features, etc... How you generate these files the engine needs to figure out.

I'm not sure from the statement above why calculating features should not be engine-independent

I did not say that and of course it does not. But, when you run an engine you might want to directly save feature trajectories with it instead of doing it again in a second step, which you could. I said that a trajectory is in essence already a feature trajectory with all atom coordinates as features. That's it. Treat normal and feature trajectories the same.

I think this is a misunderstanding. What I was saying is not that intermediate feature trajectories always have to be used. I just wanted to point out that its good to have them available, and explain the usecase where its needed (features which take long to compute).

Okay, then we completely agree. Sorry. I think this is practically a must have option, but still an option. It seemed you wanted to have a clear

  1. run trajectory
  2. then always create feature trajectories
  3. use pyemma only on features

that would also be a reasonable choice but I think its better to skip 2. if you want to.

Yes, I think we agree on this, 2.) is optional. There are different ways of implementing this:

a) featurization as a separate tasks which only computes feature trajectories and stores them
b) featurization as part of the modelling with options to store features and reuse them

I think it would make sense to use a) in cases where you want to store the features. If the features are not stored they can just be part of the PyEMMA analysis.