lsst-epo / light_curve_ml

Light curve classification prototyping

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Machine learning for light curves.

Currently focusing on broad classification of astronomical objects. Project in hibernation as of May 2018. Like a bear.

Requirements

  • Python 3
  • SQLite command line tool (optional)

Local install

  • Set LCML environment variable to repo checkout's path (e.g., export LCML=/Users/*/code/light_curve_ml)
  • cd $LCML && pip install -e . --user

AWS Ubuntu install

See instructions in conf/dev/ubuntu_install.txt

Running ML pipeline

Supervised and unsupervised machine learning pipelines are run via the run_pipeline.py entry point. It expects the path to a job (config) file and file name for logger output. For example:

python3 lcml/pipeline/run_pipeline.py --path conf/local/supervised/macho.json --logFileName super_macho.log

Job File

The pipeline expects a job file (macho.json in above example) specifying the configuration of the pipeline and detailed declaration of experiment parameters.

The specified job file supercedes and overrides the default job file (conf/common/pipeline.json) on a per field basis recursively. So any, or none, of the default fields may be overridden. The default settings are located at conf/common/pipeline.json.

Sections

Job files have the following structure:

  • globalParams - Parameters used across multiple pipeline stages
  • database - All db config and table names
  • loadData - Stage coverting raw data into coherent light curves
  • preprocessData - Stage cleaning and preprocessing light curves
  • extractFeatures - Stage extracting features from cleaned light curves
  • postprocessFeatures - Stage further processing extracted features
  • modelSearch - Stage testing several ML models with differing hyperparameters
    • function - search function name
    • model - ML model spec including non-searched parameters
    • params - parameters controlling the model search
  • serialization - Stage persisting ML model and metadata to disk

Pipeline 'stages' are customizable processors. Each stage definition has the following components:

  • skip - Boolean determining whether stage should execute
  • params - stage-specific parameters
  • writeTable - name of db table to which output is written

Example Jobs

Some representative job files provided in this repo include:

  • local/supervised/fast_macho.json - Runs tiny portion of MACHO dataset through all supervised stages. Useful for pipeline debugging and for integration testing.
  • local/supervised/macho.json - Full supervised learning pipeline for MACHO dataset. Uses feets library for feature extraction and random forests for classification.
  • local/supervised/ogle3.json - Ditto for OGLE3
  • local/unsupervised/macho.json - Unsupervised learning pipeline for MACHO focused on Mini-batch KMeans and Agglomerative clustering

Other Scripts

  • lcml.data.acquisistion - Scripts used to acquire and/or process various datasets including MACHO, OGLE3, Catalina, and Gaia
  • lcml.poc - One-off proof-of-concept scripts for various libaries

Logging Config

The LoggingManager class allows for convenient customization of Python Logger objects. The default Logging config is specified conf/common/logging.json. This config should contain the following main keys:

  • basicConfig - values passed to logging.basicConfig
  • handlers - handler definitions with a type attribute, which may be either stream or file
  • modules - list of module specific logger level settings

Main modules should initialize the manager by invoking LoggingManager.initLogging at the start of execution before logger objects have been created.

About

Light curve classification prototyping

License:MIT License


Languages

Language:Python 99.4%Language:Shell 0.6%