Python writer for MCPL files

Question

Python writer for MCPL files

grzanka opened this issue 2 years ago · comments

hi, this is a great tool. I saw a python code to inspect and load binary MCPL files.
Are you planning to provide a python wrapper to generate MCPL files ?
This would simplify working with python to generate phasespace files from binary files generated by other MC codes (i.e. SHIELD-HIT12A). We are discussing with @nbassler a converter (https://github.com/DataMedSci/pymchelper) which would read multiple binary files generated in parallel run of SHIELDHIT12A and then convert them into single MCPL phasespace file.

Thomas Kittelmann · Answer 1 · Fri Mar 31 2023 17:41:54 GMT+0800 (China Standard Time)

Absolutely, this is planned and often discussed (see e.g. #54). However, it is not completely clear when there will be time to work on this - if lucky then maybe in the fall. For that reason, I recommend you either merge them with the mcpltool (compiled) cmdline utility, or with a bit of custom C code (look for the word "merge" in mcpl.h).

What level of custom filtering or editing would your use-case need?

Leszek Grzanka · Answer 2 · Fri Mar 31 2023 18:16:30 GMT+0800 (China Standard Time)

I believe we don't need any filtering at all. We have a python reader of the binary files generated by SHIELD-HIT12A code (https://github.com/DataMedSci/pymchelper/blob/master/pymchelper/readers/shieldhit/reader_bdo2019.py#L16) . It will be used
once we have implemented a feature of scoring phase space file in the SHIELD-HIT12A. In the reader we will have access to couple of numpy arrays which we would simply like to dump to MCPL file.
So in fact simple numpy to MCPL converter would be enough for us.

Erik B Knudsen · Answer 3 · Fri Mar 31 2023 18:55:25 GMT+0800 (China Standard Time)

While obviously the best thing is to wait for @tkittel to write the python-code, in the meantime, I'd write a python extension in c to do this. It is not very difficult.

Thomas Kittelmann · Answer 4 · Fri Mar 31 2023 19:04:56 GMT+0800 (China Standard Time)

Well, a "numpy to MCPL converter' is basically the entire task, it is not really simple at all.

If you don't care too much about interfaces, features, and efficiency, you can hack something together. You have 4 options, depending on your tastes:

Use python c-types to access the functions in libmcpl.so.
Write a python extension in C, build against mcpl.h+libmcpl.so.
Simply create your own little C-application (or lib) which takes your input data and creates MCPL files via mcpl.h+libmcpl.so. Call that C-application (or lib, via ctypes) from python.
Read the appendix in the MCPL paper which gives the details of the MCPL format and simply write custom code for outputting your data from python directly into the MCPL binary format. This should be a lot simpler task than the general MCPL python writer, since you don't have to support any other options for output format than the ones you want to use.

I personally recommend #3, but there is no accounting for taste :-)

Erik B Knudsen · Answer 5 · Fri Mar 31 2023 19:50:22 GMT+0800 (China Standard Time)

Agreed - what I meant was: a hacky "for now"-solution would be simple to write.
To do it right and efficient is not necessarily simple at all.

Erik B Knudsen · Answer 6 · Tue Apr 11 2023 16:06:44 GMT+0800 (China Standard Time)

Hi there,
I couldn't help myself so I put something together here: https://github.com/ebknudsen/np2mcpl
It is far from fast, but works well for me up to ~10^8 particles in a single numpy array.
N.b. so far it only works with double precision.

Leszek Grzanka · Answer 7 · Tue Apr 11 2023 17:41:39 GMT+0800 (China Standard Time)

That looks good, can I install it via pip and add to my requirements? wt., 11 kwi 2023, 10:06 użytkownik Erik B Knudsen ***@***.***> napisał:

…

Hi there, I couldn't help myself so I put something together here: https://github.com/ebknudsen/np2mcpl It is far from fast, but works well for me up to ~10^8 particles in a single numpy array. — Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABYIPDLXSJQPTCZOSQBW65LXAUGJ5ANCNFSM6AAAAAAWOMFFW4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Thomas Kittelmann · Answer 8 · Tue Apr 11 2023 19:06:06 GMT+0800 (China Standard Time)

Thanks Erik, I am sure some people might find this workaround pretty useful!

One caveat that you might want to make clear though is that the PDG code is hardcoded as neutrons in the output.

Erik B Knudsen · Answer 9 · Tue Apr 11 2023 19:49:50 GMT+0800 (China Standard Time)

@grzanka No pip yet I am afraid - I'll try to set it up though - I'll have to figure out how to deal with the mcpl-lib, but that should be possible (No pip-guru I'm afraid). For now you can simply run "python setup build" to build and then point PYTHONPATH to the build directory.

@tkittel Good point! I'll make sure to document that. The presence of a non-constant PDG could be inferred from the number of columns in the numpy array, allowing ncols to be either 9,10,12, or 13. OTOH perhaps it is better to go for the ability to set it as a flag...or both.

Thomas Kittelmann · Answer 10 · Tue Apr 11 2023 20:05:25 GMT+0800 (China Standard Time)

Or perhaps it is better so simply leave it as a good example for crafty users to be able to hack on, before you get dragged flag by flag towards the complexities of the full-featured final solution ;-)

Erik B Knudsen · Answer 11 · Tue Apr 11 2023 20:12:59 GMT+0800 (China Standard Time)

There is that of course - you do have a point :-) The column number solution would be easy, though. That way all column gymnastics is simply up to numpy usage. I think I'll go for demanding 10 or 13 columns, and that will be that.

Artur Glavic · Answer 12 · Tue Jun 06 2023 15:02:55 GMT+0800 (China Standard Time)

Just want to chime in that it would be really helpful to have a python writer for me, too. It would make it much easier to use external tools when simulating samples as one can just read the events, run the tool and create new events to continue the ray-tracing after that.

Erik B Knudsen · Answer 13 · Tue Jun 06 2023 15:52:35 GMT+0800 (China Standard Time)

In the interim, you are obviously welcome to use np2mcpl if that fits your glove. Also you'd be most welcome to help out with the packaging. I unfortunately still haven't figured out how to build and package this with pip/conda.

Artur Glavic · Answer 14 · Tue Jun 20 2023 22:09:46 GMT+0800 (China Standard Time)

I can have a look about the packaging, I fear that would be quite an effort with the c-library binary as dependency.

Leszek Grzanka · Answer 15 · Tue Jun 20 2023 23:10:56 GMT+0800 (China Standard Time)

As being said on another issue: #54 (comment) I've managed to come up with pure python code which does the job. This may be a starting point for a real pure-python writer. That would not require any c-library dependencies to work.

As current python library for MCPL seems not to require c-libraries, maybe it could be extended.
I am a bit afraid of proposing another tens or hundreds of line of code to the god-object like module https://github.com/mctools/mcpl/blob/master/src/python/mcpl.py

Maybe some of the developers have some ideas how to extended the architecture of the MCPL plugin to allow it to write the data as well ?

Class MCPLFile in the __init__ method is already opening the file for reading, so simply adding a method which writes the file may not be that easy.

What do you think of similar approach as being used in numpy ? That means:

class MCPLCollection which holds collection of MCPLParticles (i.e. many MCPLParticleBlocks)
module method called from_file which takes path to the input file (and many other options) and gives you back MCPLCollection. You call it via mcpl.from_file. That is similar to np.loadtxt or pd.from_csv
a method in to_file in MCPLCollection class which dumps the data into a binary file

In a long term code could be splitted into smaller pieces (to avoid files with over thousands of line of code) and some units tests could be added.

Erik B Knudsen · Answer 16 · Wed Jun 21 2023 03:49:57 GMT+0800 (China Standard Time)

@aglavic You are most welcome to take a look if you feel like it, and you have time. Figuring out the procedure would be useful for other things as well.
On the other hand as @grzanka points out it might not be worth it if there's a pure python solution on the way soon. The intent of np2mcpl was never to be the solution for numpy 2 mcpl - simply a solution.

Thomas Kittelmann · Answer 17 · Wed Jun 21 2023 16:08:38 GMT+0800 (China Standard Time)

IMHO the two codes/examples provided by @ebknudsen and @grzanka can be used as short-term solutions. Later on, perhaps end of this year if all goes well, I will provide a proper MCPLFileWriter helper class in the python API, which will be able to work with particle blocks.

Leszek Grzanka · Answer 18 · Wed Jun 21 2023 16:28:32 GMT+0800 (China Standard Time)

IMHO the two codes/examples provided by @ebknudsen and @grzanka can be used as short-term solutions. Later on, perhaps end of this year if all goes well, I will provide a proper MCPLFileWriter helper class in the python API, which will be able to work with particle blocks.

I was curious what is the idea behind the particle blocks ? Is it a design dedicated to work with large files (say not to load everything into memory at once) ? Something similar to https://numpy.org/doc/stable/reference/generated/numpy.memmap.html ?

Thomas Kittelmann · Answer 19 · Wed Jun 21 2023 16:52:46 GMT+0800 (China Standard Time)

@grzanka yes, memory is one of the advantages as it allows you in principle to process a huge MCPL file without having to load it into memory. A 500GB MCPL file might otherwise be impossible to process on many laptops.

The other advantage is one of speed: Python statements are very expensive, which is why something like numpy exists as it allows you to express your intent in Python but have most of the workload happen in compiled code. So dealing with a whole block at a time gives us that speedup, making it essentially as fast as compiled code.

Leszek Grzanka · Answer 20 · Wed Jun 21 2023 17:02:20 GMT+0800 (China Standard Time)

@grzanka yes, memory is one of the advantages as it allows you in principle to process a huge MCPL file without having to load it into memory. A 500GB MCPL file might otherwise be impossible to process on many laptops.

The other advantage is one of speed: Python statements are very expensive, which is why something like numpy exists as it allows you to express your intent in Python but have most of the workload happen in compiled code. So dealing with a whole block at a time gives us that speedup, making it essentially as fast as compiled code.

That makes sense, I like the design, especially implementing the particles property of MCPLFile as a python generator.
The caching also explains the need of providing the filename to the __init__.
Thanks for explaining the design !

My comments above (#70 (comment)) makes sense only for loading smaller files (which fit into memory easily).