lassonet as unsupervised feature selection algorithm (LassoNetAutoEncoder does not exist !)

Question

lassonet as unsupervised feature selection algorithm (LassoNetAutoEncoder does not exist !)

B-Seif opened this issue 2 years ago · comments

Hi,
I would like to use lassonet as an unsupervised feature selection algorithm, but I can't find an example that shows how to do this in a simple way.
The only script that shows an example rebuild is the minst_ae.py but it doesn't work ( I have an error : LassoNetAutoEncoder does not exist ! .

My use case:
I have an input matrix without labels, and I want to have a new reduced matrix with only 30% of the important features.

Louis Abraham · Answer 1 · Tue Oct 04 2022 05:54:56 GMT+0800 (China Standard Time)

Have you looked at mnist_reconstruction.py? https://github.com/lasso-net/lassonet/blob/master/examples/mnist_reconstruction.py

B-Seif · Answer 2 · Tue Oct 04 2022 15:20:09 GMT+0800 (China Standard Time)

it corresponds to what I'm looking for but unfortunately I can't get the results since I have this error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [374], in <cell line: 1>()
----> 1 path = model.path(X_train, X_train)

File ~/anaconda3/envs/ieee_vit/lib/python3.9/site-packages/lassonet/interfaces.py:468, in BaseLassoNet.path(self, X, y, X_val, y_val, lambda_seq, lambda_max, return_state_dicts, callback)
    465     is_dense = False
    466     if current_lambda / lambda_start < 2:
    467         warnings.warn(
--> 468             f"lambda_start={self.lambda_start:.3f} "
    469             "might be too large.\n"
    470             f"Features start to disappear at {current_lambda=:.3f}."
    471         )
    473 hist.append(last)
    474 if callback is not None:

ValueError: Unknown format code 'f' for object of type 'str'

Do you have an idea ? I'm using Python 3.9.12 and the code that generated the error is:

model = LassoNetRegressor(M=30, n_iters=(300,500), path_multiplier=1.05)
path = model.path(X_train, X_train)

Thanks

Louis Abraham · Answer 3 · Tue Oct 04 2022 15:46:16 GMT+0800 (China Standard Time)

My last fix for #18 was not working properly. Can you try again after updating lassonet to the latest version?

B-Seif · Answer 4 · Tue Oct 04 2022 17:45:43 GMT+0800 (China Standard Time)

How can I get the latest version of lassonet? maybe by using pip install lassonet? I did that and the error persists.
Do you have an idea to overcome this problem?

Louis Abraham · Answer 5 · Tue Oct 04 2022 17:54:14 GMT+0800 (China Standard Time)

pip install -U lassonet

U is for upgrade :)

B-Seif · Answer 6 · Tue Oct 04 2022 18:01:52 GMT+0800 (China Standard Time)

I still have the same error :(

Louis Abraham · Answer 7 · Tue Oct 04 2022 18:55:43 GMT+0800 (China Standard Time)

Can you paste the error?
I literally changed the code d76a326

B-Seif · Answer 8 · Tue Oct 04 2022 18:56:56 GMT+0800 (China Standard Time)

ok I get this error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [419], in <cell line: 2>()
      1 model = LassoNetRegressor(M=30, n_iters=(300,500), path_multiplier=1.05)
----> 2 path = model.path(X_train, X_train)

File ~/anaconda3/envs/ieee_vit/lib/python3.9/site-packages/lassonet/interfaces.py:468, in BaseLassoNet.path(self, X, y, X_val, y_val, lambda_seq, lambda_max, return_state_dicts, callback)
    465     is_dense = False
    466     if current_lambda / lambda_start < 2:
    467         warnings.warn(
--> 468             f"lambda_start={self.lambda_start:.3f} "
    469             "might be too large.\n"
    470             f"Features start to disappear at {current_lambda=:.3f}."
    471         )
    473 hist.append(last)
    474 if callback is not None:

ValueError: Unknown format code 'f' for object of type 'str'

Louis Abraham · Answer 9 · Tue Oct 04 2022 19:02:09 GMT+0800 (China Standard Time)

You clearly don't have the last version. Maybe uninstall lassonet and reinstall?

Louis Abraham · Answer 10 · Tue Oct 04 2022 19:03:57 GMT+0800 (China Standard Time)

Maybe pip install "lassonet>=0.0.12" ?

B-Seif · Answer 11 · Tue Oct 04 2022 19:08:19 GMT+0800 (China Standard Time)

I have the last version 0.0.12

Louis Abraham · Answer 12 · Tue Oct 04 2022 19:37:24 GMT+0800 (China Standard Time)

Maybe you have a problem with your envs. I checked and the latest version looks like this

Louis Abraham · Answer 13 · Wed Oct 05 2022 23:53:18 GMT+0800 (China Standard Time)

For the record, here is a minimal example for auto encoders:

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

from lassonet import LassoNetRegressor

X, _ = fetch_california_housing(return_X_y=True)
X = StandardScaler().fit_transform(X)

model = LassoNetRegressor(verbose=2)
path = model.path(X, X)

B-Seif · Answer 14 · Sat Oct 08 2022 02:10:17 GMT+0800 (China Standard Time)

Hi, it works for me.
I have another question regarding the use of Lassonet in supervised paradigm : could we use or adapt this algorithm in. the multilabel setting ?

Louis Abraham · Answer 15 · Sat Oct 08 2022 02:41:48 GMT+0800 (China Standard Time)

Sure. You would need to inherit the base lassonet class and change the loss function as well as some casting functions.

Look at interfaces.py and how we implemented classifiers.

Louis Abraham · Answer 16 · Sat Oct 08 2022 02:42:20 GMT+0800 (China Standard Time)

I would be happy to review a PR implementing Multilabel classification.

B-Seif · Answer 17 · Tue Oct 11 2022 00:07:25 GMT+0800 (China Standard Time)

Hi,
I run Lassonet to reconstruct some relatively large data (17000 x 8000). it runs for hours and after that the my terminal shows me that the process was killed. I also often get this warring message :

.local/lib/python3.9/site-packages/lassonet/interfaces.py:467: UserWarning: lambda_start=429496.730 (selected automatically) might be too large.
Features start to disappear at current_lambda=429496.730.
  warnings.warn(
Killed

do you have an idea to explain this? is there a hyper param that has an influence on this ? Below is my code:

    M = np.random.uniform(0,10_000)
    path_multiplier= np.random.uniform(1.01,1.5)
    hidden1 = np.random.randint(10,100)
    hidden2 = np.random.randint(100,200)
    start_time=time.time()
    model = LassoNetRegressor(M=M,path_multiplier = path_multiplier,hidden_dims=(hidden1,hidden2))
    path = model.path(X_train, X_train)
    tmp = time.time()-start_time

Louis Abraham · Answer 18 · Tue Oct 11 2022 00:46:52 GMT+0800 (China Standard Time)

You should normalize your features first.

B-Seif · Answer 19 · Tue Oct 11 2022 01:07:25 GMT+0800 (China Standard Time)

The data is already normalized !

B-Seif · Answer 20 · Tue Oct 11 2022 03:21:57 GMT+0800 (China Standard Time)

Can I share with you the code and data to see ?

Louis Abraham · Answer 21 · Tue Oct 11 2022 14:35:22 GMT+0800 (China Standard Time)

Sure, my email is on my personal page!

B-Seif · Answer 22 · Tue Oct 11 2022 16:31:09 GMT+0800 (China Standard Time)

Ok, thanks

Louis Abraham · Answer 23 · Thu Oct 13 2022 22:13:48 GMT+0800 (China Standard Time)

import pandas as pd
from lassonet import LassoNetRegressor
from lassonet import plot_path

from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# read data
X_train = pd.read_csv("eurlex-ev-fold1-train.arff.csv").iloc[:, : 8993 - 3993].values
X_train = StandardScaler().fit_transform(X_train)
# reconstruction
model = LassoNetRegressor(
    lambda_start=5e-1,
    path_multiplier=1.05,
    n_iters=(20, 5),
    verbose=2,
)
path = model.path(X_train, X_train)
plot_path(model, path, X_train, X_train)
plt.show()

Louis Abraham · Answer 24 · Thu Oct 13 2022 22:20:30 GMT+0800 (China Standard Time)

I ran lassonet successfully using those parameters. I plotted the path of the training dataset so the performance is best with all the features.

In [12]: [(p.selected.sum().item(), p.val_loss)for p in path]
Out[12]: 
[(5000, 0.7403443455696106),
 (5000, 0.7388657331466675),
 (5000, 0.7372983694076538),
 (5000, 0.7356559634208679),
 (5000, 0.7339461445808411),
 (5000, 0.732173502445221),
 (5000, 0.7303400635719299),
 (5000, 0.728447437286377),
 (5000, 0.7264959216117859),
 (5000, 0.724486231803894),
 (5000, 0.7224189639091492),
 (5000, 0.7202948331832886),
 (5000, 0.7181147336959839),
 (5000, 0.715880274772644),
 (5000, 0.7135931849479675),
 (5000, 0.7112559676170349),
 (5000, 0.7088716626167297),
 (5000, 0.7064440846443176),
 (5000, 0.7039777636528015),
 (5000, 0.7014783620834351),
 (5000, 0.6989524960517883),
 (5000, 0.6964077949523926),
 (5000, 0.6938537359237671),
 (5000, 0.6913006901741028),
 (5000, 0.6887612342834473),
 (5000, 0.6862493753433228),
 (5000, 0.6837817430496216),
 (5000, 0.6813769936561584),
 (5000, 0.6790563464164734),
 (5000, 0.6768444776535034),
 (5000, 0.6747689843177795),
 (5000, 0.6728613376617432),
 (5000, 0.6711570024490356),
 (5000, 0.6696965098381042),
 (5000, 0.6685250997543335),
 (5000, 0.6676939725875854),
 (5000, 0.6672609448432922),
 (5000, 0.6672911047935486),
 (5000, 0.6678575277328491),
 (5000, 0.6690409779548645),
 (5000, 0.6709340214729309),
 (5000, 0.6736408472061157),
 (5000, 0.6772762537002563),
 (5000, 0.6819731593132019),
 (5000, 0.687872588634491),
 (5000, 0.6951332092285156),
 (5000, 0.7039183974266052),
 (5000, 0.714412271976471),
 (5000, 0.7267968654632568),
 (5000, 0.7412514090538025),
 (5000, 0.7578516602516174),
 (5000, 0.7759532332420349),
 (5000, 0.7900420427322388),
 (5000, 0.7914730906486511),
 (5000, 0.7910680770874023),
 (5000, 0.7905250787734985),
 (5000, 0.7899476289749146),
 (5000, 0.7894152402877808),
 (5000, 0.7889501452445984),
 (5000, 0.7885493636131287),
 (5000, 0.7882020473480225),
 (5000, 0.7878993153572083),
 (5000, 0.7876302003860474),
 (5000, 0.7873923182487488),
 (5000, 0.7871778607368469),
 (5000, 0.7869775295257568),
 (5000, 0.7867907285690308),
 (5000, 0.7866113185882568),
 (5000, 0.7864405512809753),
 (5000, 0.7862793803215027),
 (5000, 0.7861262559890747),
 (5000, 0.785978376865387),
 (5000, 0.7858325839042664),
 (5000, 0.785689651966095),
 (5000, 0.7855460047721863),
 (5000, 0.7854057550430298),
 (5000, 0.7852645516395569),
 (5000, 0.7851291298866272),
 (5000, 0.7849962115287781),
 (5000, 0.7848662734031677),
 (5000, 0.7847418189048767),
 (5000, 0.7846243381500244),
 (5000, 0.7845121622085571),
 (5000, 0.7844087481498718),
 (5000, 0.7843157052993774),
 (5000, 0.7842352986335754),
 (5000, 0.7841715812683105),
 (5000, 0.7841252684593201),
 (5000, 0.7840957045555115),
 (5000, 0.7840867042541504),
 (5000, 0.7841042876243591),
 (4999, 0.7841560244560242),
 (4998, 0.7842445373535156),
 (4988, 0.7843688130378723),
 (4967, 0.7845369577407837),
 (4916, 0.7847509384155273),
 (4835, 0.7850145101547241),
 (4691, 0.7853205800056458),
 (4458, 0.7856589555740356),
 (4089, 0.7860158681869507),
 (3566, 0.7863755822181702),
 (2730, 0.7866988182067871),
 (1757, 0.7869724631309509),
 (817, 0.7871779203414919),
 (266, 0.7872892022132872),
 (53, 0.7873258590698242),
 (11, 0.7873356938362122),
 (1, 0.7873363494873047),
 (0, 0.7873362302780151)]

Louis Abraham · Answer 25 · Thu Oct 13 2022 22:24:45 GMT+0800 (China Standard Time)

Maybe you could run the code above but run the plot_path on a different fold. This way you will know if the model was simply overfitting. The validation loss (computed on a random subset of the data) indicates that the performance is poor.