scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff

Home Page:https://pyhf.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JSON Schema / Spec discussion

kratsg opened this issue · comments

As initiated in #104, there are questions raised about the spec and the way forward with two overarching goals in mind:

  • intuitive and clean for the user
  • fully-specified and documentable by a schema for an API

There are two main issues raised as described below.

Fully-Specified

A schema like

{
  "singlechannel": {
    "background": {
      "data": [1,2,3,4],
      "mods": [...]
    },
    "signal": {
      "data": [1,2,3,4],
      "mods": [...]
    }
  }
}

is not fully-specified as it contains a dictionary with variable key-names (singlechannel, background, signal). A more fully-specified spec looks like so

[
  {
    "name": "singlechannel",
    "type": "channel",
    "samples": [
      {
        "name": "background",
            "data": [1,2,3,4]
        ],
        "mods": [...]
      },
      {
        "name": "signal",
        "data": [1,2,3,4],
        "mods": [...]
      }
    ]
  }
]

where an array of channels, and samples are specified. This is a first proposal, but still has a nested array which may or may not be useful for many -- and flattening the array is a possibility, through a process of denormalization (see firebase docs).

Intuitive-ness

Currently, modifications are defined as an array

        "mods": [
          {"type": "shapesys", "name": "mod_JES1", "data": [1,2,3,4]},
          {"type": "shapesys", "name": "mod_JES2", "data": [1,2,3,4]},
          {"type": "shapesys", "name": "mod_FlavTag", "data": [1,2,3,4]}
        ]

however, one of the draw-backs is that it makes a user think of each modification as an entire "object". That is, this should define three modification objects, which is not necessarily true. In spirit, a modification refers to a nuisance parameter, such as mod_JES1 along with configurations for that.

first proposal

A first proposal to make this more intuitive was to structure the modifications as a dictionary, with each key name referring to the nuisance parameter that is of interest

        "mods": {
          "mod_JES1": {"type": "shapesys", "data": [1,2,3,4]},
          "mod_JES2": {"type": "shapesys", "data": [1,2,3,4]},
          "mod_flavTag": {"type": "shapesys", "data": [1,2,3,4]}
        ]

A drawback is that we now have configurable dictionary key names, which does not help with JSON Schema / API specification.

second proposal

which separates the nuisance parameter from the actual modifications for a given sample/channel

        "NPs": [
          {"name": "mod_JES1", "mod": {"type": "shapesys", "data": [1,2,3,4]}},
          {"name": "mod_JES2", "mod": {"type": "shapesys", "data": [1,2,3,4]}},
          {"name": "mod_flavTag", "mod": {"type": "shapesys", "data": [1,2,3,4]}},
        ]
commented

I agree, that the explicit list is best and we should adopt this for channels and samples (and any other key-value-y thing we might encounter.

One could separate the definition of the constraint terms from the sample but some issues

  • each sample adds some detail on how it reacts to the nuisance parameters
    e.g. we have
    "samples": [
      {
        "name": "background1",
        "data": [1,2,3,4],
        "mods": [
          {"name": "mod_JES", "type": "normsys", "data": {"hi": 1.05, "lo": 0.98}}},
        ]
      },
      {
        "name": "background2",
        "data": [1,2,3,4],
        "mods": [
          {"name": "mod_JES", "type": "normsys", "data": {"hi": 1.10, "lo": 0.9}}},
        ]
      }

i.e. both samples share a nuisance par but th variation is different.. if we reorganize to separate the nuis par definition we would need to recreat some structure

variations: [
   {"name": "JES", "type": "normsys", "data": {
     "channel1": {
        "background1": {"hi": 1.05, "lo": 0.98},
        "background2": {"hi": 1.1, "lo": 0.9},
     }
   }
]

so i'm not sure you gain a lot. Also for user-familiarity probably good to stay somewhat close to the HF XML schema, which also co-locates the variations with the sample

maybe @cranmer has also some comments

so i'm not sure you gain a lot. Also for user-familiarity probably good to stay somewhat close to the HF XML schema, which also co-locates the variations with the sample

Yeah, I think it needs to be made abundantly clear that there's a portion of the NPs that is the same, regardless of the sample, and a portion of the NP that is per-sample. I think something like this

{"name": "mod_JES", "type": "normsys", "data": {"hi": 1.10, "lo": 0.9}}}

is probably the best way to do it. It separates the NPs-sample-independent portion:

{"name": "mod_JES", "type": "normsys"}

from the sample-specific configuration

{"data": {"hi": 1.10, "lo": 0.9}}

and documentation would just need to be made clear what's going on behind the scenes.

Hi all

Thanks @kratsg and @lukasheinrich ! I need to think more before commenting on the proposals, but one point to anticipate for discussions:

For the original HF spec there was a hi/low per nuisance parameter setting. The above discussion is based on that restricted spec.

For the v2 of the spec I'd like to see something for the variations that can support arbitrary map<parameter_vector, value> where value is either a histogram or a double indicating the value of a normalization factor. Also the parameter_vector key would need to specify the value for each nuisance parameter and parameter of interest. This would then be input to a more generic set of interpolation algorithms that can probe the variation at random points in the parameter space.

The v2 discussion might be too ambitious, or maybe it's worth thinking of it now.

Adding to that, we have previously discussed a few special cases of the generic special cases:

  • current axis-aligned / one-at-a-time variations with alpha_i \in {-1, 0, 1}
  • axis aligned / one at a time variations with alpha_i = {alpha_ij}
  • Cartesian product of alpha_i = {alpha_ij}
  • some subset of the points from Cartesian product (as in Experimental Design)
  • the fully general mesh of points discussed above.

In the end, they can all "compile down" to a general mesh, but not all algorithms can deal with a generic mesh. The interpolation algorithms might need to know which of those special cases they are in to type check if they spec is compatible with what they can process. So maybe we need some sort of mini-spec for each of those special cases and the top-level HF spec can accept any of those to describe the systematic variations.

For the v2 of the spec I'd like to see something for the variations that can support arbitrary map<parameter_vector, value> where value is either a histogram or a double indicating the value of a normalization factor. Also the parameter_vector key would need to specify the value for each nuisance parameter and parameter of interest. This would then be input to a more generic set of interpolation algorithms that can probe the variation at random points in the parameter space.

I actually anticipated something like this in a more generic sense when @lukasheinrich was educating me over skype. In particular, I thought something like

{"name": "mod_JES", "type": "normsys", "data": {"hi": 1.10, "lo": 0.9}}}
# roughly equivalent to a more generic case of...
{"name": "mod_JES", "type": "normsys", "params": {"ticks": [-1, 1], "values": [0.9, 1.10]}}
# or
{"name": "mod_JES", "type": "normsys", "params": [(-1, 10.9), (1, 1.10)]}

and the question is what we should call these things. Perhaps data is incredibly vague, and a better description is needed. Maybe params? if so, is ticks/values clear? Maybe points? Maybe x and y? I'm not entirely clear on the dimensionality but this almost seems entirely 2-dimensional, and we shouldn't have to deal with something like

{"name": "mod_JES", "type": "normsys", "params": {"ticks": [-1, 1], "values": [[0.9, 0.85], [1.05,1.10]]}}

right?

In the end, they can all "compile down" to a general mesh, but not all algorithms can deal with a generic mesh. The interpolation algorithms might need to know which of those special cases they are in to type check if they spec is compatible with what they can process. So maybe we need some sort of mini-spec for each of those special cases and the top-level HF spec can accept any of those to describe the systematic variations.

To this last point, this can be done with "definitions". As a real-world example, you can see the "data" definitions here (https://github.com/diana-hep/pyhf/blob/master/validation/spec.json#L40-L58) which accounts for the different ways data was currently set in pyhf. I would particularly say that once we can nail down the schema nicely, it makes the implementation a bit more straightforward.

The examples above with mod_JES are still of the "one at a time" variation. I was thinking of examples where you vary every nuisance parameter simultaneously. For instance a single histogram corresponding to the simultaneous variations (JES1=.1,JES2=-.4,flavTag=.7)

commented

@cranmer the current design is such that the pdf is built as

Pois({n_i}, {v_i(theta)} ) * Πⱼ f({aux_j} | {theta_j} )

({ }) denoting a set. i.e. aux_j and theta_j can be >1D. I.e. the constraint terms can be arbitrary m-variate pdfs with n parameters. each term comes with its own notion of how to define the "auxiliary data" of the constrained term. the full parameter array is split up into k groups of parameters

[g00 | g10 | g20 g21 g22 g23 | g30 | ...] 

some of those parameter groups belong to constraint-type terms (the theta_j above), some are unconstrained (like normfactors) e.g. (a group [g20 g21 g22 g23] might e.g. correspond to the set of gammas for a shapesys)

[µ | γ₁,  γ₂, γ₃, γ₄, | α_JES | ...]

and the means v_i(theta) can be in principle arbitrary functions of the full parameter set. for each sample in a channel, we compute the final histogram given all parameters. Right now it's still somewhat hardcoded (as in the initial spec) that we compute e.g. multiplicative factors, interpolation explicitly

https://github.com/diana-hep/pyhf/blob/master/pyhf/__init__.py#L332

but one could move to a scheme where one does

{v_i ( theta )} = f₁(f₂(f₃( .. fₙ({nom_i}, {theta_n}), ... ), {theta_3}) , {theta_2}), {theta_1} )

but then either all those modifiers are commutative, or the order must be fixed somehow

commented

one way one could e.g. spec out a simulateneous interpolation for multiple systematics is

'backgoundname': {
                'data': [... nominal array data...],
                'mods': [
                    {
                        'name': 'multisys',
                        'type': 'multisys',
                        'data': {
                            'parameters': ['JES1', 'JES2', 'FlavTag'],
                            'interpolation': 'rbf',
                            'evaluations': [
                               {"point": [0.1,0.4,0.7], "data": [... array data ...]},
                               {"point": [-0.5,0.4,-0.7], "data": [... array data ...]},
                               {"point": [0.8,-0.4,-0.7], "data": [... array data ...]},
                        }
                    }
                ]
            }

i.e. we announce what paraeter space the evluations live in, and keep a list of evaluations and some information on how the interpolation would be performed

we would then need

  • a pdf that implements the constraint term p(meas_JES1, meas_JES2, meas_FlavTag | alpha_JES1, alpha_JES2, alpha_JES3)
  • an interpolating function for a given sample based on the "training data" from the nominal and the other evaluations with signature interp(point = [alpha_jes1, alpha_jes2, alpha_jes3], training_data = evals + [nominal])

this would be too hard to do

As of #488 - I believe we have a v1.0.0 of the schema that incorporates most of the discussions here! (except for simultaneous variation of nuisance parameters).

Thanks for all the feedback so far.