lasso-net / lassonet

Feature selection in neural networks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

lassonet as unsupervised feature selection algorithm (LassoNetAutoEncoder does not exist !)

B-Seif opened this issue · comments

Hi,
I would like to use lassonet as an unsupervised feature selection algorithm, but I can't find an example that shows how to do this in a simple way.
The only script that shows an example rebuild is the minst_ae.py but it doesn't work ( I have an error : LassoNetAutoEncoder does not exist ! .

My use case:
I have an input matrix without labels, and I want to have a new reduced matrix with only 30% of the important features.

it corresponds to what I'm looking for but unfortunately I can't get the results since I have this error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [374], in <cell line: 1>()
----> 1 path = model.path(X_train, X_train)

File ~/anaconda3/envs/ieee_vit/lib/python3.9/site-packages/lassonet/interfaces.py:468, in BaseLassoNet.path(self, X, y, X_val, y_val, lambda_seq, lambda_max, return_state_dicts, callback)
    465     is_dense = False
    466     if current_lambda / lambda_start < 2:
    467         warnings.warn(
--> 468             f"lambda_start={self.lambda_start:.3f} "
    469             "might be too large.\n"
    470             f"Features start to disappear at {current_lambda=:.3f}."
    471         )
    473 hist.append(last)
    474 if callback is not None:

ValueError: Unknown format code 'f' for object of type 'str'

Do you have an idea ? I'm using Python 3.9.12 and the code that generated the error is:

model = LassoNetRegressor(M=30, n_iters=(300,500), path_multiplier=1.05)
path = model.path(X_train, X_train)

Thanks

My last fix for #18 was not working properly. Can you try again after updating lassonet to the latest version?

How can I get the latest version of lassonet? maybe by using pip install lassonet? I did that and the error persists.
Do you have an idea to overcome this problem?

pip install -U lassonet

U is for upgrade :)

I still have the same error :(

Can you paste the error?
I literally changed the code d76a326

ok I get this error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [419], in <cell line: 2>()
      1 model = LassoNetRegressor(M=30, n_iters=(300,500), path_multiplier=1.05)
----> 2 path = model.path(X_train, X_train)

File ~/anaconda3/envs/ieee_vit/lib/python3.9/site-packages/lassonet/interfaces.py:468, in BaseLassoNet.path(self, X, y, X_val, y_val, lambda_seq, lambda_max, return_state_dicts, callback)
    465     is_dense = False
    466     if current_lambda / lambda_start < 2:
    467         warnings.warn(
--> 468             f"lambda_start={self.lambda_start:.3f} "
    469             "might be too large.\n"
    470             f"Features start to disappear at {current_lambda=:.3f}."
    471         )
    473 hist.append(last)
    474 if callback is not None:

ValueError: Unknown format code 'f' for object of type 'str'

You clearly don't have the last version. Maybe uninstall lassonet and reinstall?

Maybe pip install "lassonet>=0.0.12" ?

I have the last version 0.0.12

Maybe you have a problem with your envs. I checked and the latest version looks like this
image

For the record, here is a minimal example for auto encoders:

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

from lassonet import LassoNetRegressor

X, _ = fetch_california_housing(return_X_y=True)
X = StandardScaler().fit_transform(X)

model = LassoNetRegressor(verbose=2)
path = model.path(X, X)

Hi, it works for me.
I have another question regarding the use of Lassonet in supervised paradigm : could we use or adapt this algorithm in. the multilabel setting ?

Sure. You would need to inherit the base lassonet class and change the loss function as well as some casting functions.

Look at interfaces.py and how we implemented classifiers.

I would be happy to review a PR implementing Multilabel classification.

Hi,
I run Lassonet to reconstruct some relatively large data (17000 x 8000). it runs for hours and after that the my terminal shows me that the process was killed. I also often get this warring message :

.local/lib/python3.9/site-packages/lassonet/interfaces.py:467: UserWarning: lambda_start=429496.730 (selected automatically) might be too large.
Features start to disappear at current_lambda=429496.730.
  warnings.warn(
Killed

do you have an idea to explain this? is there a hyper param that has an influence on this ? Below is my code:

    M = np.random.uniform(0,10_000)
    path_multiplier= np.random.uniform(1.01,1.5)
    hidden1 = np.random.randint(10,100)
    hidden2 = np.random.randint(100,200)
    start_time=time.time()
    model = LassoNetRegressor(M=M,path_multiplier = path_multiplier,hidden_dims=(hidden1,hidden2))
    path = model.path(X_train, X_train)
    tmp = time.time()-start_time

You should normalize your features first.

The data is already normalized !

Can I share with you the code and data to see ?

Sure, my email is on my personal page!

Ok, thanks

import pandas as pd
from lassonet import LassoNetRegressor
from lassonet import plot_path

from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# read data
X_train = pd.read_csv("eurlex-ev-fold1-train.arff.csv").iloc[:, : 8993 - 3993].values
X_train = StandardScaler().fit_transform(X_train)
# reconstruction
model = LassoNetRegressor(
    lambda_start=5e-1,
    path_multiplier=1.05,
    n_iters=(20, 5),
    verbose=2,
)
path = model.path(X_train, X_train)
plot_path(model, path, X_train, X_train)
plt.show()

Figure_1

I ran lassonet successfully using those parameters. I plotted the path of the training dataset so the performance is best with all the features.

In [12]: [(p.selected.sum().item(), p.val_loss)for p in path]
Out[12]: 
[(5000, 0.7403443455696106),
 (5000, 0.7388657331466675),
 (5000, 0.7372983694076538),
 (5000, 0.7356559634208679),
 (5000, 0.7339461445808411),
 (5000, 0.732173502445221),
 (5000, 0.7303400635719299),
 (5000, 0.728447437286377),
 (5000, 0.7264959216117859),
 (5000, 0.724486231803894),
 (5000, 0.7224189639091492),
 (5000, 0.7202948331832886),
 (5000, 0.7181147336959839),
 (5000, 0.715880274772644),
 (5000, 0.7135931849479675),
 (5000, 0.7112559676170349),
 (5000, 0.7088716626167297),
 (5000, 0.7064440846443176),
 (5000, 0.7039777636528015),
 (5000, 0.7014783620834351),
 (5000, 0.6989524960517883),
 (5000, 0.6964077949523926),
 (5000, 0.6938537359237671),
 (5000, 0.6913006901741028),
 (5000, 0.6887612342834473),
 (5000, 0.6862493753433228),
 (5000, 0.6837817430496216),
 (5000, 0.6813769936561584),
 (5000, 0.6790563464164734),
 (5000, 0.6768444776535034),
 (5000, 0.6747689843177795),
 (5000, 0.6728613376617432),
 (5000, 0.6711570024490356),
 (5000, 0.6696965098381042),
 (5000, 0.6685250997543335),
 (5000, 0.6676939725875854),
 (5000, 0.6672609448432922),
 (5000, 0.6672911047935486),
 (5000, 0.6678575277328491),
 (5000, 0.6690409779548645),
 (5000, 0.6709340214729309),
 (5000, 0.6736408472061157),
 (5000, 0.6772762537002563),
 (5000, 0.6819731593132019),
 (5000, 0.687872588634491),
 (5000, 0.6951332092285156),
 (5000, 0.7039183974266052),
 (5000, 0.714412271976471),
 (5000, 0.7267968654632568),
 (5000, 0.7412514090538025),
 (5000, 0.7578516602516174),
 (5000, 0.7759532332420349),
 (5000, 0.7900420427322388),
 (5000, 0.7914730906486511),
 (5000, 0.7910680770874023),
 (5000, 0.7905250787734985),
 (5000, 0.7899476289749146),
 (5000, 0.7894152402877808),
 (5000, 0.7889501452445984),
 (5000, 0.7885493636131287),
 (5000, 0.7882020473480225),
 (5000, 0.7878993153572083),
 (5000, 0.7876302003860474),
 (5000, 0.7873923182487488),
 (5000, 0.7871778607368469),
 (5000, 0.7869775295257568),
 (5000, 0.7867907285690308),
 (5000, 0.7866113185882568),
 (5000, 0.7864405512809753),
 (5000, 0.7862793803215027),
 (5000, 0.7861262559890747),
 (5000, 0.785978376865387),
 (5000, 0.7858325839042664),
 (5000, 0.785689651966095),
 (5000, 0.7855460047721863),
 (5000, 0.7854057550430298),
 (5000, 0.7852645516395569),
 (5000, 0.7851291298866272),
 (5000, 0.7849962115287781),
 (5000, 0.7848662734031677),
 (5000, 0.7847418189048767),
 (5000, 0.7846243381500244),
 (5000, 0.7845121622085571),
 (5000, 0.7844087481498718),
 (5000, 0.7843157052993774),
 (5000, 0.7842352986335754),
 (5000, 0.7841715812683105),
 (5000, 0.7841252684593201),
 (5000, 0.7840957045555115),
 (5000, 0.7840867042541504),
 (5000, 0.7841042876243591),
 (4999, 0.7841560244560242),
 (4998, 0.7842445373535156),
 (4988, 0.7843688130378723),
 (4967, 0.7845369577407837),
 (4916, 0.7847509384155273),
 (4835, 0.7850145101547241),
 (4691, 0.7853205800056458),
 (4458, 0.7856589555740356),
 (4089, 0.7860158681869507),
 (3566, 0.7863755822181702),
 (2730, 0.7866988182067871),
 (1757, 0.7869724631309509),
 (817, 0.7871779203414919),
 (266, 0.7872892022132872),
 (53, 0.7873258590698242),
 (11, 0.7873356938362122),
 (1, 0.7873363494873047),
 (0, 0.7873362302780151)]

Maybe you could run the code above but run the plot_path on a different fold. This way you will know if the model was simply overfitting. The validation loss (computed on a random subset of the data) indicates that the performance is poor.