Becksteinlab / GromacsWrapper

GromacsWrapper wraps system calls to GROMACS tools into thin Python classes (GROMACS 4.6.5 - 2024 supported).

Home Page:https://gromacswrapper.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Split the plugins into a separated repository

pslacerda opened this issue · comments

We can move the plugins to a separated repository, it will be cleaner and make easier to move just the main code to Python 3.

Based on namespace packages we can make the plugins into a separated repository keeping it at the same gromacs namespace.

  1. Remove the gromacs/analysis/plugins/ directory
  2. Delete the line import plugins from gromacs/analysis/__init__.py
  3. Install the whole package
  4. Install the plugins package

Now we can

import gromacs.plugins

from a separated repository =). Or is better keep the gromacs.analysis.plugins namespace because all plugins are for analysis?

The main importance of plugins are to enable parallel analysis right? I heard about some guys that parallelized frame-by-frame analysis splitting the trajectories and submitting a job for each part then combining the results. They used Spark to do it on multiple computers but I don't go that far:

gmx rmsf -b    0 -e 1000 -o rmsf_0 &
gmx rmsf -b 1001 -e 2000 -o rmsf_1 &

With this very simple trick is possible to enable parallel analysis. The operating system takes care to allocate the resources intelligently. In most cases combining the results is just a simple concatenation as in .xvg files above. If I remember correctly from the lists the Gromacs team is also pursuing trivial analysis parallelization like this by default.

If we do:

def figure_out_length(f):
    return 1000

def parallel_analysis(tool, njobs, **kwargs):
    begin = kwargs.get('b', 0)
    end = kwargs.get('e', None)
    if end is None:
        end = figure_out_length(kwargs['f'])

    kwargs_list = []
    count = 0
    for part_begin in range(begin, end, (end-begin)//njobs+1):
        part_end = part_begin + ((end-begin)//njobs) - 1
        if part_end > end:
            part_end = end
        part_kwargs = kwargs.copy()
        part_kwargs['b'] = part_begin
        part_kwargs['e'] = part_end
        for key, value in kwargs.items():
            if isinstance(value, str) and '%d' in value:
                part_kwargs[key] = value % count
        kwargs_list.append(part_kwargs)
        count = count + 1
    return kwargs_list

And then:

>>> parallel_analysis('rmsf', 8, f='traj.xtc', o='rmsf%d.xvg',  b=100, input=['3', '3'])
[{'b': 100, 'e': 211, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf0.xvg'},
 {'b': 213, 'e': 324, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf1.xvg'},
 {'b': 326, 'e': 437, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf2.xvg'},
 {'b': 439, 'e': 550, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf3.xvg'},
 {'b': 552, 'e': 663, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf4.xvg'},
 {'b': 665, 'e': 776, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf5.xvg'},
 {'b': 778, 'e': 889, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf6.xvg'},
 {'b': 891, 'e': 1000, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf7.xvg'}]

We get a list of argument dicts where each one is for a partition of the full trajectory. If we run parallelized analysis in parallel to each other we guarantee that the machine will be with all processors in use until the end of the last analysis. Massive trivial parallization. =) =)

So remains to do a xvg joiner. It is far from be my favorite file format but is a matter of write the first fully then write the remaining ones removing the header.

Some analysis as RMSD and RMSF requires a same fixed reference (i.e. -s option) to make sense. In others analysis a reference isn't needed.

I am totally happy to move the plugins into their own name space.Name space packages are a bit tricky (I think @dotsdl can attest from datreant). As far as I know, you'd then need to package everything else under a second namespace, eg gromacs.core. We might still be able to monkey-patch the tools into the top level, though. So maybe for GW, namespace packages would be useful. We could then also make the fileformats a separate package.

Regarding analysis and parallel analysis: We are almost exclusively using MDAnalysis nowadays. Combine it with pandas for time series analysis and plotting with matplotlib/seaborn.

Parallel analysis is still tricky. The blocked trajectory scheme is solid in principle. The main problem seems competing disk access – this tends to kill performance and sets a limit to how many workers you can sensibly use.

That said, I am more than happy to include anything in GW that seems to work well – so if you have a suggestion, go for it :-).

Probably because of this nobody parallelized Gromacs analysis tools. RMSF for example also seems I/O bound at least here.

Maybe are namespace packages tricky because every package need to declare itself as such? This isn't a problem here as we will have only one or two separated packages (gromacs.analysis and gromacs.fileformats). Regarding analysis plugins we can instead ignore metclassses and just inspect BasePlugin.__bases__. Or yet make them into a different namespace (e.g. gromacsplugins) and create a metclass that monkey patch analysis plugins automatically into object inside gromacs but remove the BasePlugin:

class PluginRegister(type):
    def __init__(cls, name, bases, nmspc):
        super(Plugin, cls).__init__(name, bases, nmspc)
        if not hasattr(cls, 'registry'):
            cls.registry = set()
        cls.registry.add(cls)
        cls.registry -= set([BasePlugin])

class BasePlugin(object):
    __metaclass__ = PluginRegister

I vote for namespace packages!

That example doesn't really seems I/O bound!

I definitely support some form of namespace packaging. The S/O post http://stackoverflow.com/questions/1675734/how-do-i-create-a-namespace-package-in-python/1676069#1676069 makes it look pretty straightforward and it can be done in a Python 2 and 3.3+ compatible way.

I'd like to hear @dotsdl 's opinion because he went through this for datreant, see datreant/datreant#35.

What packages would we have?

  • gromacs.core: main functionality (tools, config, utilities, ...), creates gromacs.grompp etc by monkey patching
  • gromacs.fileformats: file format readers like the XVG reader which are independent from the rest; they might have dependencies on utilities so might not be easily feasible to make it independent
  • gromacs.recipes: need to find a better name, but basically things like setup, cbook, scaling, ... anything else that can be considered building blocks for workflows but which might not be used by everyone (e.g., many power users like to write their own system setup code); most of gw-* scripts would be installed by this package.
  • gromacs.management: need a better name... but basically manager and qsub: these are lightweight attempts at workflow managements. qsub is being used e.g. in MDPOW and needs to remain available.
  • gromacs.plugins: the legacy analysis plugins; I am not sure if this is even used by anyone so I would not spend too much time making it nice... as long as it works it will be fine. (I use `gw-fit_strip_trajectories.py which uses one of the plugins but for pretty much everything else have been using MDAnalysis).

EDIT: Perhaps we shouldn't overdo it with packages that only contain a few modules. Something along

  • gromacs.core (including fileformats)
  • gromacs.toolbox (including setup, cbook, manager, qsub)
  • gromacs.analysis (or plugins?)

would work?

We can make gromacs.analysis deprecated. Or if it's legacy then we just drop it.

recipes, fileformats, management and analysis may deserve a separated package each. But tools, config and utilities can be at the same repository and doesn't need monkey patching or whatever.

As gnuplot, XVG is an almost complete language and Gromacs usage is very specific to plot one or two series along the time. So almost every XVG reader is incomplete or specific except xmgrace.

Did you saw the new_core branch? I'll do a PR. And on gmxscript there are one useful utility that is MDPReader, it can extend basic MDP files on the fly.

grompp(
  f=MDP['sd.mdp', {
    'integrator': 'steep',
    'emtol': 10.0,
    'nsteps': 10000}],
  c='ions.gro',
  o='sd.tpr'
)

Or maybe yet without a template file:

 MDP[{
  'integrator': 'steep',
  'emtol': 10.0,
  'nsteps': 10000
}]

Then a function mdp() becomes more elegant.

With these changes GromacsWrapper looks more like utilities than a complete library or framework, which is a gain in my opinion.

My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days. I'd rather have the library give an interface in Python to the GROMACS tools and nothing more, since we already have enough things to maintain these days that do just about everything else, but with more flexibility.

From my experience with datreant, I'm thinking of doing the same kind of cutting down to bare essentials there, too, since trying to do everything means maintaining lots of non-general-purpose code, and there are only so many hours in the day.

@orbeckst in that case can I focus on this for an afternoon or so? It'll be a massive PR (mostly removing an entire chunk of the library), but it will really help me finish up #44 so we can move on.

Yes, please do.

Can you dump what you cut out into a separate repo? It's going to look like a junk yard but it will give us a way to go back to it if ever need to (without digging into the history).