HSF / PyHEP.dev-workshops

PyHEP Developer workshops

Home Page:https://indico.cern.ch/e/PyHEP2023.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Analysis code readability and performance

jpivarski opened this issue · comments

Expressing small-scale analysis steps intuitively, to reduce cognitive burden, while also scaling to large datasets, using vectorization, automatic differentiation, JIT-compilation, and C++/Julia/Rust interfaces where possible.

This feels a lot like User Experience #9, doesn't it?

#3 to me is bit more low-level: the code people write using e.g. awkward (how to have a computationally efficient and readable analysis logic implemented), while #9 is more high-level: how well do the services / libraries being used integrate, how easy are they to use, are there pieces of functionality missing.

@jpivarski mentions in #9 (comment) also that #3 is also about performance, which is less of a focus for #9.

I wanted to get people who are interested in code readability and people who are interested in performance into the same discussion.

#3 was supposed to be about the "programming in the small" (line by line expressing individual formulas/small-scale steps in the analysis) and #4 was supposed to be about "programming in the large" (fitting services and libraries together). So if #9 is about

how well do the services / libraries being used integrate, how easy are they to use, are there pieces of functionality missing

then I intended #4 for that.

But instead of people fitting into the original set of categories, merging or splitting them as needed, most people are creating whole new categories. (The opposite happened at Scientific-Python; sociology is fun!) But we can go with this flow instead: the thing I wanted to avoid was silence/no interaction.

After this first wave, I'm going to bring the less talkative participants into the discussions as they exist at that time—I'll suggest topics based on what they wrote in their "why I'm interested in this workshop" boxes in the registration form.

#4 to me sounded not so much concerned with tools but higher level orchestration instead. Something like "I cannot plot my histograms easily" would be a #9 topic, but not a #4 to me (and certainly not a #3).

+1 I'm very intrigued by @jpivarski initial statement about the "small-scale analysis steps". I've often wished for a more Unix-philosophy to many of our analysis tools where there are small packages that do one thing and they do it well, and then you pipe output from one to another.

ROOT tried to keep everything in one environment which made it difficult at time to interface with other tools. My perspective is that the community has moved in a different direction and we now have smaller, piecemeal tools for the different steps and I think that's for the best.

I'm not sure if my interests are in line with or outside of the scope of "Analysis code readability" but I would love a world in which each step of an analysis was well defined and users could use whatever works for them and whatever makes sense for how they think about things. And then they could just pipe that output to a different step of the analysis. There are challenges with this approach obviously.

Whether or not this is in the scope of this part of the workshop, I'm glad to see this broader discussion happening!

We discussed an example of readability a bit in the context of https://gist.github.com/alexander-held/a9ff4928fe33e8e0c09dd27fbdfd24d9. An example of ideas of potential interface improvements are in this version of @gordonwatts, including shorthands for indexing and string axes: https://gist.github.com/gordonwatts/87d29b9e1dd13f0958968cd194e7b929.

+1

on the most radical side of the spectrum, I guess "code readability and performance" is basically "what language to use".

I'm happy to provide perspective on Julia and metaprogramming

I'm wondering if there's any appetite to turn/expand: https://github.com/iris-hep/adl-benchmarks-index into some readability metric -- it's very difficult to draw a single conclusion from the metrics like ranking them in some order, because there are many aspects of readability.

But I think we should be able to report some metric out of this collection, for example in the Julia repo I compiled numbers for:

length (in chars) of function body after stripping spaces and line breaking, excluding plots and file opening etc.

we can probably come up with a few more?