Train Error

Question

Train Error

HJH0924 opened this issue 5 months ago · comments

The steps to reproduce the error are as follows:
1、Run the command to create the environment:
sudo docker run -it -v postgresml_data:/var/lib/postgresql -p 5434:5432 -p 8000:8000 ghcr.io/postgresml/postgresml:2.8.1 sudo -u postgresml psql -d postgresml

2、Execute the following SQL statements:
DROP TABLE IF EXISTS pgml.commits_build CASCADE;
CREATE TABLE pgml.commits_build (
vector Integer[],
result bool
);

3、Insert data into the pgml.commits_build table:
INSERT INTO pgml.commits_build VALUES
('{4,5,6}', false),
('{5,6,7}', true),
('{6,7,8}', false),
('{7,8,9}', true),
('{8,9,10}', false);

4、Run the pgml.train function:
SELECT * FROM pgml.train(
'commits:category:build',
'classification',
'pgml.commits_build',
'result'
);
It can run normally and the linear algorithm has been deployed.

5、Encounter the following error:
-- ERROR: called Option::unwrap() on a None value
SELECT * FROM pgml.train('commits:category:build', algorithm => 'ridge');

Montana Low · Answer 1 · Tue Jan 23 2024 00:18:10 GMT+0800 (China Standard Time)

This is fixed by #1289. Using a dataset with 5 rows, and a 20% default split only leaves 1 row for testing, which is not enough to compute statistics against.

HJH0924 · Answer 2 · Thu Jan 25 2024 18:18:30 GMT+0800 (China Standard Time)

Where is this 20% you mentioned set? Upon inspecting the source code for "train" in api.rs, I discovered a parameter called "test_size" with a default value of 0.25. Is that what you are referring to?

Alas, my endeavors to replicate the steps proved futile as the outcome remained unchanged even after diligently executing the insertion of ten data entries in the third step.
3、Insert data into the pgml.commits_build table:
INSERT INTO pgml.commits_build VALUES
('{4,5,6}', false),
('{5,6,7}', true),
('{6,7,8}', false),
('{7,8,9}', true),
('{8,9,10}', false),
('{9,10,11}', true),
('{10,11,12}', false),
('{11,12,13}', true),
('{12,13,14}', false),
('{13,14,15}', true);

The complete replication steps are as follows:
1、sudo docker run --rm -it -v postgresml_data:/var/lib/postgresql -p 5434:5432 -p 8000:8000 ghcr.io/postgresml/postgresml:2.8.1 sudo -u postgresml psql -d postgresml
2、drop extension pgml;
3、create extension pgml;
4、DROP TABLE IF EXISTS pgml.commits_build CASCADE;
CREATE TABLE pgml.commits_build (
vector Integer[],
result bool
);
5、INSERT INTO pgml.commits_build VALUES
('{4,5,6}', false),
('{5,6,7}', true),
('{6,7,8}', false),
('{7,8,9}', true),
('{8,9,10}', false),
('{9,10,11}', true),
('{10,11,12}', false),
('{11,12,13}', true),
('{12,13,14}', false),
('{13,14,15}', true);
6、select * from pgml.commits_build;
7、SELECT * FROM pgml.train(
'commits:category:build',
'classification',
'pgml.commits_build',
'result'
);
8、SELECT * FROM pgml.train('commits:category:build', algorithm => 'svm');

Another peculiar point to note is that if the issue is due to insufficient data for training, as you suggested earlier, then the "train" command should have thrown an error when executed for the first time. Instead of allowing the execution to proceed without errors, and threw an error upon executing the "train" command for the next time.