lukasgolino / AGMVA-DL

Attempt to apply to ALPHA data

Home Page:https://link.springer.com/epdf/10.1007/s41781-020-00040-0?sharing_token=uWYOlA9Jsl_OU1qg0g26V_e4RwlQNchNByi7wbcMAY4e8rGeba1iEops3OnkDwae1e8JmnvyaaridMVKgvv13rAQE_eB-ajpYcx2W260n23De0Cs1aLY_lT-WO0vzfkcq0ZcEq2Z2HGPP5rI7PIvBpJoMx6pvshwa_MgQ43JDSg%3D

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SparkDLTrigger - Deep Learning and Spark used to build a particle classifier

This repository contains code, notebooks, and datasets used to build a machine learning pipeline for a high energy physics particle classifier using Apache Spark, ROOT, Parquet, TensorFlow and Jupyter with Python notebooks

Related articles and presentations

Physics Use Case

Event data flows collected from the particle detector (CMS experiment) contains different types of event topologies of interest. A particle classifier built with neural networks can be used as event filter, improving state of the art in accuracy.
This work reproduces the findings of the paper Topology classification with deep learning to improve real-time event selection at the LHC re-implemented using tools from the Big Data ecosystem, notably Apache Spark and Tensorflow/Keras APIs at scale.

Physics use case for the particle classifier

Authors

Contents

Note: See also the archived work in branch article_2020

Data Pipelines for Deep Learning

Data pipelines are of paramount importance to make machine learning projects successful, by integrating multiple components and APIs used for data processing across the entire data chain. A good data pipeline implementation can accelerate and improve the productivity of the work around the core machine learning tasks. The four steps of the pipeline we built are:

  • Data Ingestion: where we read data from ROOT format and from the CERN-EOS storage system, into a Spark DataFrame and save the results as a table stored in Apache Parquet files
  • Feature Engineering and Event Selection: where the Parquet files containing all the events details processed in Data Ingestion are filtered and datasets with new features are produced
  • Parameter Tuning: where the best set of hyperparameters for each model architecture are found performing a grid search
  • Training: where the best models found in the previous step are trained on the entire dataset.

Machine learning data pipeline

Results

The results of the DL model(s) training are satisfactoy and match the results of the original research paper. Loss converging, ROC and AUC

Additional Info and References