Split the plugins into a separated repository

Question

Split the plugins into a separated repository

pslacerda opened this issue 8 years ago · comments

We can move the plugins to a separated repository, it will be cleaner and make easier to move just the main code to Python 3.

Pedro Lacerda · Answer 1 · Thu Aug 04 2016 07:24:58 GMT+0800 (China Standard Time)

Based on namespace packages we can make the plugins into a separated repository keeping it at the same gromacs namespace.

Remove the gromacs/analysis/plugins/ directory
Delete the line import plugins from gromacs/analysis/__init__.py
Install the whole package
Install the plugins package

Now we can

import gromacs.plugins

from a separated repository =). Or is better keep the gromacs.analysis.plugins namespace because all plugins are for analysis?

Pedro Lacerda · Answer 2 · Fri Aug 05 2016 07:41:55 GMT+0800 (China Standard Time)

The main importance of plugins are to enable parallel analysis right? I heard about some guys that parallelized frame-by-frame analysis splitting the trajectories and submitting a job for each part then combining the results. They used Spark to do it on multiple computers but I don't go that far:

gmx rmsf -b    0 -e 1000 -o rmsf_0 &
gmx rmsf -b 1001 -e 2000 -o rmsf_1 &

With this very simple trick is possible to enable parallel analysis. The operating system takes care to allocate the resources intelligently. In most cases combining the results is just a simple concatenation as in .xvg files above. If I remember correctly from the lists the Gromacs team is also pursuing trivial analysis parallelization like this by default.

If we do:

def figure_out_length(f):
    return 1000

def parallel_analysis(tool, njobs, **kwargs):
    begin = kwargs.get('b', 0)
    end = kwargs.get('e', None)
    if end is None:
        end = figure_out_length(kwargs['f'])

    kwargs_list = []
    count = 0
    for part_begin in range(begin, end, (end-begin)//njobs+1):
        part_end = part_begin + ((end-begin)//njobs) - 1
        if part_end > end:
            part_end = end
        part_kwargs = kwargs.copy()
        part_kwargs['b'] = part_begin
        part_kwargs['e'] = part_end
        for key, value in kwargs.items():
            if isinstance(value, str) and '%d' in value:
                part_kwargs[key] = value % count
        kwargs_list.append(part_kwargs)
        count = count + 1
    return kwargs_list

And then:

>>> parallel_analysis('rmsf', 8, f='traj.xtc', o='rmsf%d.xvg',  b=100, input=['3', '3'])
[{'b': 100, 'e': 211, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf0.xvg'},
 {'b': 213, 'e': 324, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf1.xvg'},
 {'b': 326, 'e': 437, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf2.xvg'},
 {'b': 439, 'e': 550, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf3.xvg'},
 {'b': 552, 'e': 663, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf4.xvg'},
 {'b': 665, 'e': 776, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf5.xvg'},
 {'b': 778, 'e': 889, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf6.xvg'},
 {'b': 891, 'e': 1000, 'f': 'traj.xtc', 'input': ['3', '3'], 'o': 'rmsf7.xvg'}]

We get a list of argument dicts where each one is for a partition of the full trajectory. If we run parallelized analysis in parallel to each other we guarantee that the machine will be with all processors in use until the end of the last analysis. Massive trivial parallization. =) =)

Pedro Lacerda · Answer 3 · Fri Aug 05 2016 07:45:53 GMT+0800 (China Standard Time)

So remains to do a xvg joiner. It is far from be my favorite file format but is a matter of write the first fully then write the remaining ones removing the header.

Pedro Lacerda · Answer 4 · Fri Aug 05 2016 12:26:04 GMT+0800 (China Standard Time)

Some analysis as RMSD and RMSF requires a same fixed reference (i.e. -s option) to make sense. In others analysis a reference isn't needed.

Oliver Beckstein · Answer 5 · Fri Aug 05 2016 14:13:34 GMT+0800 (China Standard Time)

I am totally happy to move the plugins into their own name space.Name space packages are a bit tricky (I think @dotsdl can attest from datreant). As far as I know, you'd then need to package everything else under a second namespace, eg gromacs.core. We might still be able to monkey-patch the tools into the top level, though. So maybe for GW, namespace packages would be useful. We could then also make the fileformats a separate package.

Oliver Beckstein · Answer 6 · Fri Aug 05 2016 14:17:03 GMT+0800 (China Standard Time)

Regarding analysis and parallel analysis: We are almost exclusively using MDAnalysis nowadays. Combine it with pandas for time series analysis and plotting with matplotlib/seaborn.

Parallel analysis is still tricky. The blocked trajectory scheme is solid in principle. The main problem seems competing disk access – this tends to kill performance and sets a limit to how many workers you can sensibly use.

That said, I am more than happy to include anything in GW that seems to work well – so if you have a suggestion, go for it :-).

Pedro Lacerda · Answer 7 · Sun Aug 07 2016 04:49:39 GMT+0800 (China Standard Time)

Probably because of this nobody parallelized Gromacs analysis tools. RMSF for example also seems I/O bound at least here.

Maybe are namespace packages tricky because every package need to declare itself as such? This isn't a problem here as we will have only one or two separated packages (gromacs.analysis and gromacs.fileformats). Regarding analysis plugins we can instead ignore metclassses and just inspect BasePlugin.__bases__. Or yet make them into a different namespace (e.g. gromacsplugins) and create a metclass that monkey patch analysis plugins automatically into object inside gromacs but remove the BasePlugin:

class PluginRegister(type):
    def __init__(cls, name, bases, nmspc):
        super(Plugin, cls).__init__(name, bases, nmspc)
        if not hasattr(cls, 'registry'):
            cls.registry = set()
        cls.registry.add(cls)
        cls.registry -= set([BasePlugin])

class BasePlugin(object):
    __metaclass__ = PluginRegister

I vote for namespace packages!

Pedro Lacerda · Answer 8 · Sun Aug 07 2016 06:45:48 GMT+0800 (China Standard Time)

That example doesn't really seems I/O bound!

Oliver Beckstein · Answer 9 · Mon Aug 08 2016 01:52:26 GMT+0800 (China Standard Time)

I definitely support some form of namespace packaging. The S/O post http://stackoverflow.com/questions/1675734/how-do-i-create-a-namespace-package-in-python/1676069#1676069 makes it look pretty straightforward and it can be done in a Python 2 and 3.3+ compatible way.

I'd like to hear @dotsdl 's opinion because he went through this for datreant, see datreant/datreant#35.

What packages would we have?

gromacs.core: main functionality (tools, config, utilities, ...), creates gromacs.grompp etc by monkey patching
gromacs.fileformats: file format readers like the XVG reader which are independent from the rest; they might have dependencies on utilities so might not be easily feasible to make it independent
gromacs.recipes: need to find a better name, but basically things like setup, cbook, scaling, ... anything else that can be considered building blocks for workflows but which might not be used by everyone (e.g., many power users like to write their own system setup code); most of gw-* scripts would be installed by this package.
gromacs.management: need a better name... but basically manager and qsub: these are lightweight attempts at workflow managements. qsub is being used e.g. in MDPOW and needs to remain available.
gromacs.plugins: the legacy analysis plugins; I am not sure if this is even used by anyone so I would not spend too much time making it nice... as long as it works it will be fine. (I use `gw-fit_strip_trajectories.py which uses one of the plugins but for pretty much everything else have been using MDAnalysis).

EDIT: Perhaps we shouldn't overdo it with packages that only contain a few modules. Something along

gromacs.core (including fileformats)
gromacs.toolbox (including setup, cbook, manager, qsub)
gromacs.analysis (or plugins?)

would work?

Pedro Lacerda · Answer 10 · Mon Aug 08 2016 10:35:29 GMT+0800 (China Standard Time)

We can make gromacs.analysis deprecated. Or if it's legacy then we just drop it.

recipes, fileformats, management and analysis may deserve a separated package each. But tools, config and utilities can be at the same repository and doesn't need monkey patching or whatever.

As gnuplot, XVG is an almost complete language and Gromacs usage is very specific to plot one or two series along the time. So almost every XVG reader is incomplete or specific except xmgrace.

Did you saw the new_core branch? I'll do a PR. And on gmxscript there are one useful utility that is MDPReader, it can extend basic MDP files on the fly.

grompp(
  f=MDP['sd.mdp', {
    'integrator': 'steep',
    'emtol': 10.0,
    'nsteps': 10000}],
  c='ions.gro',
  o='sd.tpr'
)

Or maybe yet without a template file:

 MDP[{
  'integrator': 'steep',
  'emtol': 10.0,
  'nsteps': 10000
}]

Then a function mdp() becomes more elegant.

With these changes GromacsWrapper looks more like utilities than a complete library or framework, which is a gain in my opinion.

David L. Dotson · Answer 11 · Sun Feb 19 2017 03:08:14 GMT+0800 (China Standard Time)

My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days. I'd rather have the library give an interface in Python to the GROMACS tools and nothing more, since we already have enough things to maintain these days that do just about everything else, but with more flexibility.

From my experience with datreant, I'm thinking of doing the same kind of cutting down to bare essentials there, too, since trying to do everything means maintaining lots of non-general-purpose code, and there are only so many hours in the day.

Oliver Beckstein · Answer 12 · Mon Feb 20 2017 06:38:59 GMT+0800 (China Standard Time)

I think everyone is in agreement there. Just needs to be done...

…

-- Oliver Beckstein email: orbeckst@gmail.com

Am Feb 18, 2017 um 12:08 schrieb David Dotson ***@***.***>: My vote, FWIW: keep it simple. I think cutting out complex analysis entirely is a good idea, especially if this isn't really getting any use these days

David L. Dotson · Answer 13 · Tue Feb 21 2017 05:17:38 GMT+0800 (China Standard Time)

@orbeckst in that case can I focus on this for an afternoon or so? It'll be a massive PR (mostly removing an entire chunk of the library), but it will really help me finish up #44 so we can move on.

Oliver Beckstein · Answer 14 · Tue Feb 21 2017 06:01:50 GMT+0800 (China Standard Time)

Yes, please do.

Can you dump what you cut out into a separate repo? It's going to look like a junk yard but it will give us a way to go back to it if ever need to (without digging into the history).

Oliver Beckstein · Answer 15 · Wed Oct 11 2017 06:39:21 GMT+0800 (China Standard Time)

Updated https://github.com/Becksteinlab/GromacsWrapper/wiki/Analysis-plugins