CSS581_Group8_Project_Cell_Pertubations

https://www.kaggle.com/competitions/open-problems-single-cell-perturbations

Running the project

Download and populate the data folder with the data from the kaggle competition as per the below file structure.

pip install -r requirements.txt

File structure for the data folder

├── data
│   ├── adata_excluded_ids.csv
│   ├── adata_obs_meta.csv
│   ├── adata_train.parquet
│   ├── de_df.csv
│   ├── de_train_clustered.parquet
│   ├── de_train.parquet
│   ├── de_train_updated.parquet
│   ├── id_map.csv
│   ├── lincs_id_compound_mapping.parquet
│   ├── model_predictions_vs_actual.csv
│   ├── multiome_obs_meta.csv
│   ├── multiome_train.parquet
│   ├── multiome_var_meta.csv
│   └──sample_submission.csv
├── encoders
├── models
├── nn_auto_rev2
│   ├── data_preprocessing.py
│   ├── main.py
│   ├── model.py
│   ├── train.py
│   └── utils.py
├── nn_only_src
│   ├── data_processing.py
│   ├── evaluation.py
│   ├── main.py
│   ├── model.py
│   └── training.py
├── output
├── output.txt
├── README.md
└── requirements.txt

Requirements

Download and populate the data folder with the data from the kaggle competition.

pip install -r requirements.txt

Repository for Predictive Modeling in Cellular Response Analysis

Model Architecture

Our predictive model, implemented in model.py, is a transformer-based neural network, TransformerNN, developed using PyTorch.

TransformerNN: A subclass of PyTorch's nn.Module, TransformerNN features multi-head attention, customizable layers, and dropout rate. It's designed to capture cell responses to different chemical compounds.
Sparse Features & Target Encoding: Distinct representations are used for target encoding and sparse features, encoding cell type and chemical interactions.

Training Process

Outlined in training.py, our training methodology includes:

Data Split: Using sklearn's train_test_split to create training and validation sets.
Training Mechanics: train_model function manages the training epochs, learning rate, and device setup.
Optimization & Learning Rate Adjustment: Adam optimizer and PyTorch's ReduceLROnPlateau scheduler are used, along with the Huber loss function for stability and reduced outlier sensitivity.

Implemented in nn_only.src.

Second Approach: ComplexAutoencoder and ComplexNet

ComplexAutoencoder: For dimensionality reduction, comprising an encoder, latent space, and decoder. Targets essential data features while preventing overfitting.
ComplexNet: Utilizes latent space representations for predictions, integrating linear layers, ReLU activation, dropout, and a transformer encoder layer.

Training Process:

Autoencoder Training: Focuses on optimizing latent space representation.
ComplexNet Training: Concentrates on learning from reduced feature space after autoencoder training.

Integration Steps:

Data Preparation: Loading and preprocessing from id_map.csv.
Model Setup: Loading and setting ComplexAutoencoder and ComplexNet to evaluation mode.
Feature Encoding & Prediction: Encoding features and predicting latent space representation.
Decoding and Gene Expression Prediction: Using the decoder to predict gene expressions.
Post-Processing: Structuring predictions for submission.

Implemented in nn_auto_rev2.src.

Results and Evaluation

Our submission in the Kaggle competition:

Performance Metric: Achieved a MRRMSE of 0.822, ranking 749th.
Benchmark: The top score was 0.729 by N. Jean Kouagou.
Analysis: Our performance was influenced by our first-time use of ComplexAutoencoder and ComplexNet and project time constraints.

Future Directions

Generative Adversarial Networks: Exploring GANs for modeling cellular reactions.
Chemical Analysis Libraries: Augmenting data processing with tools like RDKit.
Training and Tuning Improvements: Advancing optimization techniques, loss functions, and network architectures.

This README documents our methods, results, and future plans in developing predictive models for cellular response analysis to chemical compounds.

F-Sossi / ML_Project_Cell_Pertubations