Pipeline Produced Before Generations Completed

Question

Pipeline Produced Before Generations Completed

gokhanonderaksu opened this issue a year ago · comments

Hello,

So I am running this code to get a pipeline by using TPOT version of 0.12.0:

from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

df = pd.read_excel('C:/Users/OneDrive/Desktop/KodSystems/TPOT/abc.xlsx')

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

print(X.shape, y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=42)

tpot = TPOTRegressor(generations=10, population_size=50, verbosity=2, random_state=42, n_jobs=-2 ,cv=10)
...

perform the search

tpot.fit(X_train, y_train)

export the best model

tpot.export('abc.py')

extracted_best_model = tpot.fitted_pipeline_.steps[-1][1]
extracted_best_model.fit(X_train ,y_train)
print(extracted_best_model.feature_importances_)

However, it gives me a pipeline, before 10 generation is completed, as the following:

(7478, 5) (7478,)

Best pipeline: RandomForestRegressor(input_matrix, bootstrap=True, max_features=0.7500000000000001, min_samples_leaf=11, min_samples_split=9, n_estimators=100)
[0.06836239 0.08344129 0.18414733 0.25859585 0.40545313]

When I change random_state to 1 from 42, it does give me a pipeline after 10 generations of run, but the same thing happens in another dataset with shape of (7478, 2061). I have run the same datasets in 0.11.7 version, but didn't get any problem. What could be the reason, and the solution for that problem?

Thanks in advance!

Lingepumpe · Answer 1 · Fri Aug 04 2023 21:32:44 GMT+0800 (China Standard Time)

Have a similar issue, there is a exception that is not caught that terminates the training loop. I am not sure why this exception doesn't get raised and show a stack trace by default, but if I specifically extend the "try" block in base.py:813 to also catch other exceptions, I got:

Traceback (most recent call last):
  File "/home/myusername/.cache/pypoetry/virtualenvs/myproject--nQ0R-Yy-py3.11/lib/python3.11/site-packages/tpot/base.py", line 817, in fit
    self._pop, _ = eaMuPlusLambda(
                   ^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/myproject--nQ0R-Yy-py3.11/lib/python3.11/site-packages/tpot/gp_deap.py", line 255, in eaMuPlusLambda
    population[:] = toolbox.select(population + offspring, mu)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/myproject--nQ0R-Yy-py3.11/lib/python3.11/site-packages/deap/tools/emo.py", line 41, in selNSGA2
    assignCrowdingDist(front)
  File "/home/myusername/.cache/pypoetry/virtualenvs/myproject--nQ0R-Yy-py3.11/lib/python3.11/site-packages/deap/tools/emo.py", line 132, in assignCrowdingDist
    crowd.sort(key=lambda element: element[0][i])
  File "/home/myusername/.cache/pypoetry/virtualenvs/myproject--nQ0R-Yy-py3.11/lib/python3.11/site-packages/deap/tools/emo.py", line 132, in <lambda>
    crowd.sort(key=lambda element: element[0][i])
                                   ~~~~~~~~~~^^^
IndexError: tuple index out of range

So ind.fitness.values is a tuple of a larger size for the first individual in the population, and then a later one has a smaller tuple leading to the IndexError. Indeed, the reason is that there are some elements in the population with ind.fitness.valid == False, with a empty tuple for ind.fitness.values.

Not sure why this is.

Pedro Ribeiro · Answer 2 · Sat Aug 12 2023 07:12:37 GMT+0800 (China Standard Time)

I believe this may be the same thing happening in #1313

gokhanonderaksu · Answer 3 · Thu Aug 24 2023 13:21:50 GMT+0800 (China Standard Time)

Hello, so I've tried a couple of datasets, which I got early crash errors with 0.12.0, by using 0.12.1 version, and until now they run smoothly, thanks so much for a quick response!