graph-of-thoughts

(Note that this was published months before the https://github.com/spcl/graph-of-thoughts repo & paper. I don't think they based their work off this repo, but some kind of ack would have been polite)

The following is based on a paper recently hitting arxiv - "Tree of Thoughts" https://arxiv.org/abs/2305.10601

The concept is depth/breadth first search on a tree of chain of thoughts using LLMs.

For this 'graph of thoughts' approach, it is a bit different version of the paper. It is being used to autonomously improve an ML program.

It creates 3 alternative paths, and then chooses the best one and tries to improve that. It loops recursively until ctrl-C.

It starts with a basic sklearn dataset and code and then we ask GTP4 to improve its r2_score. The starting point was the following code, base.py in the repo.

data.pkl is the california housing dataset, stored as 'data.pkl' so as not to clue GPT4 in as to what the optimal alg should be from its training data.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd

# Fetch the data                                                                                                                 
data = pd.read_pickle("data.pkl")

# Split into features (X) and target (y)                                                                                         
X, y = data.data, data.target

# Split into training and testing sets                                                                                           
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the model                                                                                                          
model = LinearRegression()

# Train the model                                                                                                                
model.fit(X_train, y_train)

# Make predictions                                                                                                               
predictions = model.predict(X_test)

# Compute and display r^2 score                                                                                                  
print('r2_score:', r2_score(y_test, predictions))

get_best_model.py is the code to start the recursive loop generating the graph of thoughts.

Here are the results:

Note: these insights are generated by GPT4, see the source files. They get extracted and fed to each prompt as they're discovered -> only the last row on the list had all of the insights minus one in the prompt.

Insight	Initial File	New File	Initial Score	New Score
Changing the model from LinearRegression to Ridge with alpha=1.0 and adding StandardScaler	base.py	base_n0.py	0.575	0.576
Changing the model from LinearRegression to Ridge with alpha=1.0, adding StandardScaler, and applying PolynomialFeatures with degree=2	base.py	base_n1.py	0.575	0.647
Changing the model from LinearRegression to Ridge with alpha=10.0, adding StandardScaler, and applying PolynomialFeatures with degree=3	base.py	base_n2.py	0.575	-14.131
Changing the model from Ridge with alpha=1.0 to Lasso with alpha=0.1	base_n1.py	base_n1_n0.py	0.647	0.482
Changing the model from Ridge with alpha=1.0 to ElasticNet with alpha=0.1 and l1_ratio=0.5	base_n1.py	base_n1_n1.py	0.647	0.515
Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection	base_n1.py	base_n1_n2.py	0.647	0.656
Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection and using a pipeline for preprocessing	base_n1.py, base_n1_n2.py	base_n1_n2_n0.py	0.656	0.656
Changing the model from Ridge with alpha=1.0 to RidgeCV with automatic alpha selection, using a pipeline for preprocessing, and increasing the degree of PolynomialFeatures to 3	base_n1.py, base_n1_n2_n0.py	base_n1_n2_n1.py	0.656	-15.415
Changing the degree of PolynomialFeatures from 2 to 3 and using a pipeline for preprocessing	base_n1_n2.py	base_n1_n2_n2.py	0.656	-15.415
Changing the model from RidgeCV with automatic alpha selection to LassoCV with automatic alpha selection	base_n1_n2_n0.py	base_n1_n2_n0_n0.py	0.656	0.482
Changing the model from RidgeCV with automatic alpha selection to RandomForestRegressor with 100 estimators	base_n1_n2_n0.py	base_n1_n2_n0_n1.py	0.656	0.799
Changing the model from RidgeCV with automatic alpha selection to GradientBoostingRegressor with n_estimators=200, learning_rate=0.1, and max_depth=2	base_n1_n2_n0.py	base_n1_n2_n0_n2.py	0.656	0.775
Changing the model from RidgeCV with automatic alpha selection to RandomForestRegressor with GridSearchCV for hyperparameter tuning	base_n1_n2_n0.py	base_n1_n2_n0_n1_n0.py	0.799	0.802
Changing the model from RandomForestRegressor with 100 estimators to GradientBoostingRegressor with n_estimators=300, learning_rate=0.1, and max_depth=3	base_n1_n2_n0_n1.py	base_n1_n2_n0_n1_n1.py	0.799	0.817

You can find the source for these in the repo.

There are a lot of optimisations that you can do here, limited only by your imagination (and the 8k/32k context window). Some ideas are in the paper linked to above, some you'll find on various places where this concept is discussed. Basic ideas include: dupe checks, pruning, backtracking and monte carlo.

Some basic insight tracking was added as per above, which wasn't exactly in the tree of thoughts paper. This also isn't strictly graph like, as the insights carry globaly. GPT4 tokens do start to add up after awhile.

Another idea is appending a set of selected techniques to suggest to GPT4 that it might try. Impedance mismatch is not a problem and these techniques can be mostly reused for any arbitrary ML problem.

FAQ

Wouldn't it be cheaper and easier to just do X?

Sure, but then why not just make X your baseline. If automl or optuna is your choice, you can start there. Or feed them in as a library of selected techniques.
Why did it take so long for GPT4 to try something other than linear models?

I noticed that as well, it's an indication as to the limits of GPT4 reasoning capabilities. Better use of the context window by adding rules of thumb / heuristics would help.

You might encounter some folks lower down in the stack that will call this 'prompt hacking', but for their benefit:

qrdlgit / graph-of-thoughts

graph-of-thoughts

About

Languages