open a new issue since the Train Error #1296 problem has been closed.

Question

open a new issue since the Train Error #1296 problem has been closed.

HJH0924 opened this issue 5 months ago · comments

#1296 (comment)
Where is this 20% you mentioned set? Upon inspecting the source code for "train" in api.rs, I discovered a parameter called "test_size" with a default value of 0.25. Is that what you are referring to?

Alas, my endeavors to replicate the steps proved futile as the outcome remained unchanged even after diligently executing the insertion of ten data entries in the third step.
3、Insert data into the pgml.commits_build table:
INSERT INTO pgml.commits_build VALUES
('{4,5,6}', false),
('{5,6,7}', true),
('{6,7,8}', false),
('{7,8,9}', true),
('{8,9,10}', false),
('{9,10,11}', true),
('{10,11,12}', false),
('{11,12,13}', true),
('{12,13,14}', false),
('{13,14,15}', true);

The complete replication steps are as follows:
1、sudo docker run --rm -it -v postgresml_data:/var/lib/postgresql -p 5434:5432 -p 8000:8000 ghcr.io/postgresml/postgresml:2.8.1 sudo -u postgresml psql -d postgresml
2、drop extension pgml;
3、create extension pgml;
4、DROP TABLE IF EXISTS pgml.commits_build CASCADE;
CREATE TABLE pgml.commits_build (
vector Integer[],
result bool
);
5、INSERT INTO pgml.commits_build VALUES
('{4,5,6}', false),
('{5,6,7}', true),
('{6,7,8}', false),
('{7,8,9}', true),
('{8,9,10}', false),
('{9,10,11}', true),
('{10,11,12}', false),
('{11,12,13}', true),
('{12,13,14}', false),
('{13,14,15}', true);
6、select * from pgml.commits_build;
7、SELECT * FROM pgml.train(
'commits:category:build',
'classification',
'pgml.commits_build',
'result'
);
8、SELECT * FROM pgml.train('commits:category:build', algorithm => 'svm');

Another peculiar point to note is that if the issue is due to insufficient data for training, as you suggested earlier, then the "train" command should have thrown an error when executed for the first time. Instead of allowing the execution to proceed without errors, and threw an error upon executing the "train" command for the next time.

Montana Low · Answer 1 · Fri Jan 26 2024 11:30:56 GMT+0800 (China Standard Time)

The training function is documented, along with test_size here.

This error is not a training error, it's deployment error. When the first training run has insufficient test data to compute metrics, it's still deployed since there is no currently deployed model, but further runs can't be deployed because the deploy model has no statistics. This means your 3rd/4th/5th training run will continue to experience the same issue that issue is fixed by #1289 as mentioned in #1296. To clear this error in your local deployment, you'll need to deploy a model with valid statistics computed on a valid training/test data set. Deployments are documented here.

Additionally, please consider using a statistically valid dataset provided by pgml.load_dataset(), similar to the examples when reporting issues, since often times the issue with machine learning is in the data. e.g. https://github.com/postgresml/postgresml/blob/master/pgml-extension/examples/binary_classification.sql