Backend that converts qastle
to run on an ATLAS xAOD backend.
This allows you to query hierarchical data stored in a root file that has been written using the ATLAS xAOD format. This code allows you to query that.
A short list of some of the features that are supported by the xAOD
C++ translator follows.
Many, but not all, parts of the python
language are supported. As a general rule, anything that is a statement or flow control is not supported. So no if
or while
or for
statements, for example. Assignment isn't supported, which may sound limiting - but this is a functional implementation so it is less to than one might think.
What follows are the parts of the language that are covered:
- Function calls, method calls, property references, and lambda calls (and lambda functions), with some limitations.
- Integer indexing into arrays
- Limited tuple support as a means of collecting information together, or as an output to a ROOT file.
- Limited list support (in same way as above). In particular, the
append
method is not supported as that modifies the list, rather than creating a new one. - Unary, Binary, and comparison operations. Only 2 argument comparisons are supported (e.g.
a > b
and nota > b > c
). - Using
and
andor
to combine conditional expressions. Note that this is written as&
and|
when writing an expression due to the factpython
demands abool
return fromand
andor
when written in code. - The conditional if expression (
10 if a > 10 else 20
) - Floating point numbers, integers, and strings.
You can call the functions that are supported by the C++ objects as long as the required arguments are primitive types. Listed below are special extra functions attached to various objects in the ATLAS xAOD data model.
The event object has the following special functions to access collections:
Jets
,Tracks
,EventInfo
,TruthParticles
,Electrons
,Muons
, andMissingET
. Each function takes a single argument, the name of the bank in the xAOD. For example, for the electrons one can pass"Electrons"
.
Adding new collections is fairly easy.
Template functions don't make sense yet in python.
getAttribute
- this function is templated, so must be called as eithergetAttributeFloat
orgetAttributeVectorFloat
.
- Math Operators: +, -, *, /, %, **
- Comparison Operators: <, <=, >, >=, ==, !=
- Unary Operators: +, -, not
- Math functions are pulled from the C++
cmath
library:sin
,cos
,tan
,acos
,asin
,atan
,atan2
,sinh
,cosh
,tanh
,asinh
,acosh
,atanh
,exp
,ldexp
,log
,ln
,log10
,exp2
,expm1
,ilogb
,log1p
,log2
,scalbn
,scalbln
,pow
,sqrt
,cbrt
,hypot
,erf
,erfc
,tgamma
,lgamma
,ceil
,floor
,fmod
,trunc
,round
,rint
,nearbyint
,remainder
,remquo
,copysign
,nan
,nextafter
,nexttoward
,fdim
,fmax
,fmin
,fabs
,abs
,fma
. - Do not use
math.sin
in a call. Howeversin
is just fine. If you do, you'll get an exception during resolution that it doesn't know how to translatemath
. - for things like
sum
,min
,max
, etc., use theSum
,Min
,Max
LINQ predicates.
It is possible to inject metadata into the qastle
query to alter the behavior of the C++ code production. Each sub-section below has a different type of metadata. In order to invoke this, use the Metadata
call, which takes as input stream and outputs the same stream, but the argument is a dictionary which contains the metadata.
A few things about metadata:
- No two metadata blocks can have the same name and different content. However, it is legal for them to have different dependencies. In that case, the multiple blocks are treated as a single block with a union of the dependencies.
- Exceptions (
ValueError
) are raised if the dependency graph can't be completed, or a circular dependency is discovered.
If you have a method that returns a non-standard type, use this metadata type to specify to the backend the return type. There are two different forms for this metadata - one if a single item is returned, and another if a collection of items are returned.
For a single item:
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_method_type_info" |
type_string | The object the method applies to, fully qualified, C++ | "xAOD::Jet" |
method_name | Name of the method | "pT" |
return_type | Type returned, C++, fully qualified | "float" , "float*" , "float**" |
deref_count | Number of times to dereference object before invoking this method (optional) | 2 |
Note: deref_count
is used when an object can "hide" hold onto other objects by dereferencing them (e.g. by overriding the operator operator*
). If it is zero (as it mostly is since operator*
isn't often overridden), then it can be omitted.
For a collection:
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_method_type_info" |
type_string | The object the method applies to, fully qualified, C++ | "xAOD::Jet" |
method_name | Name of the method | "jetWeights" |
return_type_element | The type of the collection element | "float" |
return_type_collection | The type of the collection | vector<float> , vector<float>* |
deref_count | Number of times to dereference object before invoking this method (optional) | 2 |
These are inline functions - they are placed inline in the code, surrounded by a braces. Only the result
is declared
outside, and expected to be set somewhere inside the block. This mechanism can also specify a method. In that case
the optional parameter instance_obj
should be specified.
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_cpp_function" |
name | C++ Function Name | "DeltaR" |
include_files | List of include files | [vector, TLorentzVector.h] |
arguments | List of argument names | [vec1, vec2] |
code | List of code lines | ["auto t = (vec1+vec2);", "auto result = t.m();"] |
instance_object | Present only if this is an object replacement. It species the code string that should be replaced by the current object | "xAOD::Jet_vt" |
method_object | The object name that the method can be called on. Present only if this is a method. | "obj_j" |
result_name | If not using result what should be used (optional) |
"my_result" |
return_type | C++ return type | double |
return_is_collection | If true, then the return is a collection of return_type |
True |
Note that a very simple replacement is done for result_name
- so it needs to be a totally unique name. The back-end may well change result
to some other name (like r232
) depending on the complexity of the expression being parsed.
If two functions are sent with the same name they must be identical or behavior is undefined.
ATLAS runs job scripts to configure its environment. These are needed to do things like apply corrections, etc. This block allows those to be added on the fly. In ATLAS these jobs scripts are python.
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_job_script" |
name | Name of this script block | "apply_corrections" |
script | List of lines of python | ["calibration = makeAnalysis('mc')", "job.addSequence(calibration)"] |
depends_on | List of other script blocks that this should come after | ["correction_setup"] |
A dependency graph is built from the depends_on
entry, otherwise the blocks will appear in a random order.
NOTE: Currently the CMS backend will ignore any job script metadata sent to it.
CMS and ATLAS store their basic reconstruction objects as collections (e.g. jets, etc.). You can define new collections on the fly with the following metadata
For ATLAS:
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_atlas_event_collection_info" |
name | The name of the collection (used to access it from the dataset object) | "TruthParticles" |
include_files | List of include files to use when accessing collection | ['file1.h', 'file2.h'] |
container_type | The container object that is filled | "xAOD::ElectronContainer" |
element_type | The element in the container. In atlas this is a pointer. | "xAOD::Electron" |
contains_collection | Some items are singletons (like EventInfo ) |
True or False |
For CMS AOD:
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_cms_aod_event_collection_info" |
name | The name of the collection (used to access it from the dataset object) | "Vertex" |
include_files | List of include files to use when accessing collection | ['DataFormats/VertexReco/interface/Vertex.h'] |
container_type | The container object that is filled | "reco::VertexCollection" |
element_type | The element in the container. | "reco::Vertex" |
contains_collection | Some items are singletons (like EventInfo ) |
True or False |
element_pointer | Indicates if the element type is a pointer | True or False |
For CMS miniAOD:
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "add_cms_miniaod_event_collection_info" |
name | The name of the collection (used to access it from the dataset object) | "Muon" |
include_files | List of include files to use when accessing collection | [DataFormats/PatCandidates/interface/Muon.h] |
container_type | The container object that is filled | "pat::MuonCollection" |
element_type | The element in the container. | "pat::Muon" |
contains_collection | Some items are singletons (like EventInfo ) |
True or False |
element_pointer | Indicates if the element type is a pointer | True or False |
Code blocks provide a way to inject various lines of C++ into code. There are a number of options, and any combinations of keys can be used.
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "inject_code" |
name | The name of the code block | "code_block_1" |
body_includes | List of files to include in the C++ file (query.cpp ). |
["file1.hpp", "file2.hpp"] |
header_includes | List of files to include in the C++ header file (query.hpp ). |
["file1.hpp", "file2.hpp"] |
private_members | List of class instance variables to declare (query.hpp ) |
["int first;", "int second;"] |
instance_initialization | Initializers added to the constructor in the main C++ class file (query.cpp ) |
["first(10)", "second(10)"] |
ctor_lines | Lines of C++ to add to the body of the constructor (query.cpp ) |
["second = first * 10;"] |
link_libraries | Items to add to the CMake LINK_LIBRARIES list (CMakeLists.txt ) |
["TrigDecisionToolLib"] |
A few things to note:
- Note the items that have semicolons and those that do not. This is crucial - the system will not add them in those cases!
- While the ordering of lines withing a single
inject_code
metadata block will be maintained, different blocks may be reordered arbitrarily. - Include files always use the double-quote:
#include "file1.hpp"
- The name of the code block is not used anywhere, and it must be unique. If two code blocks are submitted with the same name but different contents it will generate an error.
This metadata can only be used if you are running against a local file (e.g. using xAODDataset
or similar). It allows you to configure which image you want to run against.
Key | Description | Example |
---|---|---|
metadata_type | The metadata type | "inject_code" |
image | The docker image and tag to run | "atlas/analysisbase:21.2.195" |
The xAOD
code only renders the func_adl
expression as a ROOT file. The ROOT file contains a simple TTree
in its root directory.
- If
AsROOTTTree
is the top levelfunc_adl
node, then the tree name and file name are taken from that expression. Only a sequence of pythontuples
or a single item can be understood byAsROOTTTree
. - If a
Select
sequence ofint
ordouble
is the lastfunc_adl
expression, then a file calledxaod_output.root
will be generated, and it will contain aTTree
calledatlas_xaod_tree
with a single column, calledcol1
. - If a
Select
sequence of atuple
is the lastfunc_adl
expression, then a file calledxaod_output.root
will be generated, and it will contain aTTree
calledatlas_xaod_tree
with a columns namedcol1
,col2
, etc. - If a
Select
sequence of dictionary's is the lastfunc_adl
expression, then a file calledxaod_output.root
will be generated, and it will contain aTTree
calledatlas_xaod_tree
, with column names taken from the dictionary keys.
ServiceX
(and the servicex
frontend package) can convert from ROOT to other formats like a pandas.DataFrame
or an awkward
array.
Setting up the development environment:
- After creating a virtual environment, do a setup-in-place:
pip install -e .[test]
To run tests:
pytest -m "not atlas_xaod_runner and not cms_runner"
will run the fast tests.pytest -m "atlas_xaod_runner"
,pytest -m "cms_aod_runner"
andpytest -m "cms_miniaod_runner"
will run the slow tests for ATLAS xAOD, CMS AOD and CMS miniAOD respectively that require docker installed on your command line.docker
is involved via pythonsos.system
- so it needs to be available to the test runner.- The CI on github is setup to run tests against python
3.7
,3.8
, and3.9
(only the non-xaod-runner tests).
Contributing:
- Develop in another repo or on a branch
- Submit a PR against the
master
branch.
In general, the master
branch should pass all tests all the time. Releases are made by tagging on the master branch.
Publishing to PyPi:
- Automated by declaring a new release (or pre-release) in github's web interface
Designed for running locally, it is possible to setup and use the xAOD
backend if you have docker
installed on your local machine. To use this you first need to install the local flavor of this package:
pip install func_adl_xAOD[local]
You can then use the xAODDataset
object, the CMSRun1AODDataset
object and CMSRun2miniAODDataset
to execute qastle
running on a docker image for ATLAS or CMS Run 1 AOD, locally.
- Specify the local path to files you want to run on in the arguments to the constructor
- Files are run serially, and in a blocking way
- This code is designed for development and testing work, and is not designed for large-scale production running on local files (not that that couldn't be done).
When something odd happens and you really want to look at the C++ output, you can do this by including the following code somewhere before the xAOD
backend is executed. This will turn on logging that will dump the output from the run and will also dump the C++ header and source files that were used to execute the query.
import logging
logging.basicConfig()
logging.getLogger("func_adl_xAOD.common.local_dataset").setLevel(level=logging.DEBUG)
- In general, the first two lines are a good thing to have in your notebooks, etc. It allows you to see where warning messages are coming from and might help when things are going sideways.
Note that some of the local runners will use a docker volume to cache calibration files and the like. If you need a truly fresh start, you'll need to remove the volume first.