F-Sossi / ML_Project_Cell_Pertubations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSS581_Group8_Project_Cell_Pertubations

https://www.kaggle.com/competitions/open-problems-single-cell-perturbations

Running the project

Download and populate the data folder with the data from the kaggle competition as per the below file structure.

pip install -r requirements.txt

File structure for the data folder

├── data
│   ├── adata_excluded_ids.csv
│   ├── adata_obs_meta.csv
│   ├── adata_train.parquet
│   ├── de_df.csv
│   ├── de_train_clustered.parquet
│   ├── de_train.parquet
│   ├── de_train_updated.parquet
│   ├── id_map.csv
│   ├── lincs_id_compound_mapping.parquet
│   ├── model_predictions_vs_actual.csv
│   ├── multiome_obs_meta.csv
│   ├── multiome_train.parquet
│   ├── multiome_var_meta.csv
│   └──sample_submission.csv
├── encoders
├── models
├── nn_auto_rev2
│   ├── data_preprocessing.py
│   ├── main.py
│   ├── model.py
│   ├── train.py
│   └── utils.py
├── nn_only_src
│   ├── data_processing.py
│   ├── evaluation.py
│   ├── main.py
│   ├── model.py
│   └── training.py
├── output
├── output.txt
├── README.md
└── requirements.txt

Requirements

Download and populate the data folder with the data from the kaggle competition.

pip install -r requirements.txt

Repository for Predictive Modeling in Cellular Response Analysis


Model Architecture

Our predictive model, implemented in model.py, is a transformer-based neural network, TransformerNN, developed using PyTorch.

  • TransformerNN: A subclass of PyTorch's nn.Module, TransformerNN features multi-head attention, customizable layers, and dropout rate. It's designed to capture cell responses to different chemical compounds.

  • Sparse Features & Target Encoding: Distinct representations are used for target encoding and sparse features, encoding cell type and chemical interactions.


Training Process

Outlined in training.py, our training methodology includes:

  • Data Split: Using sklearn's train_test_split to create training and validation sets.
  • Training Mechanics: train_model function manages the training epochs, learning rate, and device setup.
  • Optimization & Learning Rate Adjustment: Adam optimizer and PyTorch's ReduceLROnPlateau scheduler are used, along with the Huber loss function for stability and reduced outlier sensitivity.

Implemented in nn_only.src.


Second Approach: ComplexAutoencoder and ComplexNet

  • ComplexAutoencoder: For dimensionality reduction, comprising an encoder, latent space, and decoder. Targets essential data features while preventing overfitting.
  • ComplexNet: Utilizes latent space representations for predictions, integrating linear layers, ReLU activation, dropout, and a transformer encoder layer.

Training Process:

  • Autoencoder Training: Focuses on optimizing latent space representation.
  • ComplexNet Training: Concentrates on learning from reduced feature space after autoencoder training.

Integration Steps:

  1. Data Preparation: Loading and preprocessing from id_map.csv.
  2. Model Setup: Loading and setting ComplexAutoencoder and ComplexNet to evaluation mode.
  3. Feature Encoding & Prediction: Encoding features and predicting latent space representation.
  4. Decoding and Gene Expression Prediction: Using the decoder to predict gene expressions.
  5. Post-Processing: Structuring predictions for submission.

Implemented in nn_auto_rev2.src.


Results and Evaluation

Our submission in the Kaggle competition:

  • Performance Metric: Achieved a MRRMSE of 0.822, ranking 749th.
  • Benchmark: The top score was 0.729 by N. Jean Kouagou.
  • Analysis: Our performance was influenced by our first-time use of ComplexAutoencoder and ComplexNet and project time constraints.

Future Directions

  • Generative Adversarial Networks: Exploring GANs for modeling cellular reactions.
  • Chemical Analysis Libraries: Augmenting data processing with tools like RDKit.
  • Training and Tuning Improvements: Advancing optimization techniques, loss functions, and network architectures.

This README documents our methods, results, and future plans in developing predictive models for cellular response analysis to chemical compounds.

About


Languages

Language:Python 100.0%