SparkDLTrigger - Deep Learning and Spark used to build a particle classifier

This repository contains code, notebooks, and datasets used to build a machine learning pipeline for a high energy physics particle classifier using Apache Spark, ROOT, Parquet, TensorFlow and Jupyter with Python notebooks

Physics Use Case

Event data flows collected from the particle detector (CMS experiment) contains different types of event topologies of interest. A particle classifier built with neural networks can be used as event filter, improving state of the art in accuracy.
This work reproduces the findings of the paper Topology classification with deep learning to improve real-time event selection at the LHC re-implemented using tools from the Big Data ecosystem, notably Apache Spark and Tensorflow/Keras APIs at scale.

Authors

Authors and contacts: Matteo.Migliorini@cern.ch, Riccardo.Castellotti@cern.ch, Luca.Canali@cern.ch
Original research article, raw data and neural network models by: T.Q. Nguyen et al., Comput Softw Big Sci (2019) 3: 12
Acknowledgements: Marco Zanetti, Thong Nguyen, Maurizio Pierini, Viktor Khristenko, CERN openlab, members of the Hadoop and Spark service at CERN, CMS Bigdata project, Intel team for BigDL and Analytics Zoo consultancy: Jiao (Jennie) Wang and Sajan Govindan.

Download datasets.
Data preparation using Apache Spark
- Data ingestion and feature preparation
- Preparation of the datasets in Parquet and TFRecord formats
Model tuning
- Hyperparameter tuning
Model training
- HLF classifier with Keras, a simple model and small dataset
  - This is a simple classifier with DNN
  - The notebooks illustrate also various methods for feeding Parquet data to TensorFlow, via memory, via Pandas and using TFReconds and tf.data
- Inclusive classifier, training of a complex model with large-scale data
  - This classifier uses an LSTM and is data-intensive
  - This shows a case when the training when data cannot fit into memory
- Methods for distributed training
- Training using tree-based models run in parallel using Spark
  - Methods with Spark MLlib Random forest, XGBoost and LightGBT
- Saved models

Note: See also the archived work in branch article_2020

Data Pipelines for Deep Learning

Data pipelines are of paramount importance to make machine learning projects successful, by integrating multiple components and APIs used for data processing across the entire data chain. A good data pipeline implementation can accelerate and improve the productivity of the work around the core machine learning tasks. The four steps of the pipeline we built are:

Data Ingestion: where we read data from ROOT format and from the CERN-EOS storage system, into a Spark DataFrame and save the results as a table stored in Apache Parquet files
Feature Engineering and Event Selection: where the Parquet files containing all the events details processed in Data Ingestion are filtered and datasets with new features are produced
Parameter Tuning: where the best set of hyperparameters for each model architecture are found performing a grid search
Training: where the best models found in the previous step are trained on the entire dataset.

Results

The results of the DL model(s) training are satisfactoy and match the results of the original research paper.

Additional Info and References

About

Attempt to apply to ALPHA data

https://link.springer.com/epdf/10.1007/s41781-020-00040-0?sharing_token=uWYOlA9Jsl_OU1qg0g26V_e4RwlQNchNByi7wbcMAY4e8rGeba1iEops3OnkDwae1e8JmnvyaaridMVKgvv13rAQE_eB-ajpYcx2W260n23De0Cs1aLY_lT-WO0vzfkcq0ZcEq2Z2HGPP5rI7PIvBpJoMx6pvshwa_MgQ43JDSg%3D

Apache License 2.0

Languages

Language:Jupyter Notebook 99.5%Language:Python 0.4%Language:Scala 0.1%

lukasgolino / AGMVA-DL