🥇 4th Place Solution for Open Problems - Multimodal Single-Cell Integration

Results

Private: 0.773145, 4th place
Public: 0.813093

Solution

You may find more detailed explanation here

Ⅰ. Cite

1. Data preprocessing

np.log1p

2. Feature engineering

Tsvd: TruncatedSVD(n_components=128, random_state=42)
UMAP: UMAP(n_neighbors = 16,n_components=128, random_state=42,verbose = True)
Novel’s method: The original method can be found here.
name importance: Mainly based on AmbrosM's notebook. But added additional information from mygene
corr importance: Top 3 features that correlated with each target.
rf importance: Top 128 important feature of the random forest model.

3. Models

Method	Stacking	GMNN	NN_online	CNN	kernel_rigde	LGBM	Catboost
CV	0.89677	0.89596	0.89580	0.89530	0.89326	0.89270	0.89100

GMNN: Gated Map Neural Network. A NN trying to do something like the Transformers and RNN without using feature vectors.
CNN: Inspired by the tmp method here and also added multidimensional convolution kernel like the Resnet.
NN(Online): A NN model based on a kaggle online notebook
Kernel Rigde: Inspired by the best solution of last year's competition. Used Ray Tune to optimize the hypermeters
Catboost: MultiOutputCatboostRegressor class which can use earlystopping to prevent overfitting when compaired with sklearn.multioutput.MultiOutputRegressor
LBGM: MultiOutputLGBMRegressor which can use earlystopping to prevent overfitting when compaired with sklearn.multioutput.MultiOutputRegressor
Stacking: Used KNN,CNN,ridge,rf,catboost,GMNN for the first layer and only CNN,catboost,GMNN for the second and just a simple MLP for the last layer. To avoid overfitting, I used special CV strategy which can do k-fold by donor and oof predictions together

CV Results	Model Ⅰ (vaild 32606)	Model Ⅱ (vaild 13176)	Model Ⅲ (vaild 31800)
Fold 1	0.8989	0.8967	0.8947
Fold 2	0.8995	0.8967	0.8951
Fold 3	0.8985	0.8959	0.8949
Fold Mean	0.89897	0.89643	0.89490
Model Mean	0.89677	-	-

Ⅱ. Multi

1. Data preprocessing & Feature engineering

inputs:

TF-IDF normalization
np.log1p(data * 1e4)
Tsvd -> 512

targets:

Normalization -> mean = 0, std = 1
Tsvd -> 1024

2. Models

GMNN: Gated Map Neural Network. The output of the model is 1024 dim and make dot product with tsvd.components_(constant) to get the final prediction than use correl_loss to calculate the loss then back propagate the grads.
Catboost: The results from online notebook
LGBM: The same as the MultiOutputLGBMRegressor mentioned above. Using MSE to fit the tsvd results of normalized targets.

Ⅲ. Ensemble

You may refer to the ensemble notebook

File tree

- working (⭐You are here now)
  - cite
    - Catboost
    - CNN
    - GMNN
    - Kernel_Ridge
    - LGBM
    - NN_Online
    - Stacking
  - data_preprocessing
    - cite.ipynb
    - multi.ipynb
    - new_cite_train_final.npz # https://www.kaggle.com/datasets/oliverwang15/cite-final
    - new_cite_test_final.npz # https://www.kaggle.com/datasets/oliverwang15/cite-final
  - ensemble
  - multi
    - Catboost
    - GMNN
    - LGBM
  - pics
  - utils
- input (You need to download those data)
  - multimodal-single-cell-as-sparse-matrix # https://www.kaggle.com/datasets/fabiencrom/multimodal-single-cell-as-sparse-matrix
  - open-problems-multimodal # https://www.kaggle.com/competitions/open-problems-multimodal/data
  - open-problems-raw-counts # https://www.kaggle.com/datasets/ryanholbrook/open-problems-raw-counts

oliverwang15 / 4th-Place-Solution-for-Open-Problems-Multimodal-Single-Cell

🥇 4th Place Solution for Open Problems - Multimodal Single-Cell Integration

Results

Solution

Ⅰ. Cite

1. Data preprocessing

2. Feature engineering

3. Models

Ⅱ. Multi

1. Data preprocessing & Feature engineering

inputs:

targets:

2. Models

Ⅲ. Ensemble

File tree

About

Languages