To train a model on a different SQL table

Question

To train a model on a different SQL table

Sandy26 opened this issue 6 years ago · comments

Hi Mr.Li,

I have download the entire code package. I would like to train a model for my own SQL table. From what I understand I need to-

Annotate my table using annotate.py
2)Then use preprocess.py,train.py,evaluate.py.

But I am confused about the input data to annotate.py. In what format should I give my SQL table to annotate.py? And once that part runs through rest scripts should technically run ok, correct? Or do I need to make any other format changes?

Any guidance is greatly appreciated! Thank you!

Li Dong · Answer 1 · Sat May 26 2018 03:56:14 GMT+0800 (China Standard Time)

Hi @Sandy26 ,

The WikiSQL data format can be found in https://github.com/salesforce/WikiSQL . Thanks!

Stephen Prior · Answer 2 · Sun Jun 03 2018 08:22:36 GMT+0800 (China Standard Time)

@Sandy26 I've a notebook that runs through the steps of adding and annotating a new table and query in https://github.com/stprior/coarse2fine/blob/predict/wikisql/Exploration.ipynb . Note I'm just getting familiar with this so it may well have errors.

Li Dong · Answer 3 · Mon Jun 04 2018 02:21:08 GMT+0800 (China Standard Time)

@stprior Solved in #4 .

Sandy26 · Answer 4 · Mon Jun 04 2018 12:32:19 GMT+0800 (China Standard Time)

@stprior Thank you so very much. This is very helpful!!!. I will let you know how it goes :) cos I believe I need to add one more step and actually train a model on my sql tables. Thank you once again for providing a solid first step in that direction!
@donglixp Thank you for the piece of code mentioned above

Sandy26 · Answer 5 · Mon Jun 04 2018 12:42:01 GMT+0800 (China Standard Time)

@stprior : Just want to make sure I understand this correctly . In your notebook(https://github.com/stprior/coarse2fine/blob/predict/wikisql/Exploration.ipynb)-

"Set up a question. The SQL field describes the expected query when training or testing. In this notebook it is not used, but it should still make sense for the table (e.g. conds should not specify a column number which is not in the table)."

In [57]:
question = {"phase": 1,
"table_id": "1-10753917-1",
"question": "How many wins did the ferrari team have after 1950 and before 1960?",
"sql": {"sel": 1, "conds": [[2, 0, "Williams"], [8, 0, "2"]], "agg": 3}
}

So ideally in a "test" question I should not need an "sql:" part correct? But I should give some value (with any columns in the table) just so that the code doesn't break?
Thank you,
Sandy

Stephen Prior · Answer 6 · Mon Jun 04 2018 19:23:01 GMT+0800 (China Standard Time)

@Sandy26 yes, if you are just using the pretrained model to generate SQL, I've had good results for unseen tables and queries just doing that. If your new tables or queries that are substantially different to those in the WikiSQL data set you would get better results by providing training data and training a new model or modifying the existing one. Also note there was an error in the notebook which I've just fixed, and I intend to use the code from #4 which should improve results too.

Sandy26 · Answer 7 · Wed Jun 06 2018 20:06:51 GMT+0800 (China Standard Time)

Hi Stephen,
Your python notebook was very helpful.I was able to follow the steps to add my own table and question.
But the code works or crashes depending upon my question. For some questions it works , that is gives me a query .But for some questions it crashes giving this error-

Traceback (most recent call last):
File "test_mytable.py", line 99, in
result_list=translator.translate(batch)
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Translator.py", line 89, in translate
op_batch_list, self.fields['cond_op'].vocab.stoi[table.IO.PAD_WORD]).t())
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Utils.py", line 46, in add_pad
return torch.LongTensor(r_list).cuda()
RuntimeError: given sequence has an invalid size of dimension 2: 0
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/init.py", line 5, in
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in
import apport.fileutils
File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in
from apport.packaging_impl import impl as packaging
File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 20, in
import apt
File "/usr/lib/python3/dist-packages/apt/init.py", line 23, in
import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):
File "test_mytable.py", line 99, in
result_list=translator.translate(batch)
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Translator.py", line 89, in translate
op_batch_list, self.fields['cond_op'].vocab.stoi[table.IO.PAD_WORD]).t())
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Utils.py", line 46, in add_pad
return torch.LongTensor(r_list).cuda()
RuntimeError: given sequence has an invalid size of dimension 2: 0

And even for the questions that get a query, it has incorrect column numbers. I think that is telling me that the current model is not working for me and I need to train a new one with my own table. Does that sound correct ? Also any thoughts why I might get the above error for some questions and not for others? as of now, I have failed to find a pattern in the failing and succeeding questions.

Thank you once again,
Sandy

Li Dong · Answer 8 · Wed Jun 06 2018 20:42:06 GMT+0800 (China Standard Time)

Hi @Sandy26 ,

The exception "RuntimeError: given sequence has an invalid size of dimension 2: 0" is caused by that all the queries in the current batch do not have WHERE clauses. So the "r_list" contains empty lists. A quick fix would be changing the https://github.com/donglixp/coarse2fine/blob/master/wikisql/table/Utils.py#L41 into:

max_len = max(1, max((len(b) for b in b_list)))

which would add padding tokens for "r_list".

Thanks!

Sandy26 · Answer 9 · Thu Jun 07 2018 11:23:46 GMT+0800 (China Standard Time)

Hi Li,
Thank you for that suggestion. It works, as in at least the code gets to the end. What I realised is if I were to ask a grammatically incorrect/incomplete question (As in a question that may be asked by someone for whom English is not the first language) my current model cannot find the where clause and hence the error.
For example- If the question is-
How many low priority items? (No verb phrase)=> the model is unable to find "where priority=low"
But if the question is-
How many items have low priority? => I get where col0=low

The last problem part is though col0 is not the "priority" column. But that I think should get better once I train the model on my own data, correct?

Just wanted to let you know about my findings.

Thank you,
Shruti

Li Dong · Answer 10 · Thu Jun 07 2018 20:20:35 GMT+0800 (China Standard Time)

Hi @Sandy26 ,

It would be better to train the model on your dataset if the questions are of different patterns. Thanks!

Sandy26 · Answer 11 · Thu Jun 07 2018 20:58:18 GMT+0800 (China Standard Time)

Hi Li,
Since I will be feeding the code a new table, I tried to find where the column names for the table are mapped to numerical values like 0,1,2... ? Or does the training data take care of it? As in when we give the "sql: " part in the train.json file?

Thank you,
Shruti

Sandy26 · Answer 12 · Fri Jun 08 2018 18:05:57 GMT+0800 (China Standard Time)

Hi Li,Stephen,

Any thoughts on how difficult it will be to add "DISTINCT" and "LIMIT" functionality to this model? As in say the question is -
"Tell me any three types of fruits in stock."
then ideally my query will be-
SELECT DISTINCT Fruits FROM Table WHERE In_stock=1 LIMIT 3

Just curious!
Thank you,
Shruti

Stephen Prior · Answer 13 · Sat Jun 09 2018 07:43:21 GMT+0800 (China Standard Time)

Hi Shruti,

Adding new clauses like DISTINCT and LIMIT would be possible, but difficult. It might be possible to treat it as a new aggregate category and layout, but the WikiSQL lib code would need to be changed to include this new operation, and the coarse2fine code would need to be rewritten too. You would also need plenty of training examples, and would need to train the model from scratch because the pre-trained model would not be usable.

Training to target the existing supported SQL syntax would be easier, training could start from the existing pre-trained model and less training data required. The column numbers just come from the order column descritions they appear in the header JSON entries of the table files.
Regard,
Stephem

Sandy26 · Answer 14 · Mon Jun 11 2018 11:39:55 GMT+0800 (China Standard Time)

Hi Stephen,
Thank you for sharing your thoughts. I don't think I really understood when you said- "Training to target the existing supported SQL syntax would be easier, training could start from the existing pre-trained model and less training data required".
Do you mean that I can use the existing pretrained model and some how train it to include some examples from my table? So should I just append my annotated questions to train.jsonl and train the model? Or is there a better way to do it?
Thank you,
Shruti

Sandy26 · Answer 15 · Mon Jun 11 2018 12:06:33 GMT+0800 (China Standard Time)

Hi Stephen,
Just appending to my previous question- I guess to start with pretrained model and add my own training data to it- Is it about line 177 in train.py?

Load checkpoint if we resume from a previous training. ?

But I don't know how to use it (I have the pretained model, my new annotated questions and table but not sure how I process to use it)

many thanks,
Shruti

Stephen Prior · Answer 16 · Tue Jun 12 2018 08:35:41 GMT+0800 (China Standard Time)

I haven't actually tried training the model yet, but I plan to in the next week or so - I'll put up some notes when I do. If you add data to the existing dev train and test data files and run annotate.py on them you shouldn't need to make code changes, you should be able to follow the top level run.sh script. I don't know how many training examples you would need to make a difference to the model though. Alternatively you could train using mostly or only your own examples, but the quality of the model for more general queries would probably drop then.

Li Dong · Answer 17 · Wed Jun 13 2018 00:39:46 GMT+0800 (China Standard Time)

Hi @Sandy26 ,

Supporting these queries needs some code modifications and new annotated examples. Then the new model can be trained from scratch.

Sandy26 · Answer 18 · Wed Jun 13 2018 19:38:20 GMT+0800 (China Standard Time)

Hi Stephen,Li,

I tried to train my model using my ~200 queries+ 300 from WikiSQl. I tried to keep them evenly matched so that my data gets better say. As expected the accuracy is very low but at least does better than pretrained model as is for my queries. While annotating queries though I realised that for training queries, there has to always be a where clause? As in cond cannot be like cond:[[, , "" ]]. So it how can one train very simple queries like-

What is the total number of transactions?
SQL: Select count(transactions) from table X

Or can I tweek annotate.py that can give cond:[[]] a harmless value?

Thank you,
Shruti

Stephen Prior · Answer 19 · Sun Jun 17 2018 07:38:45 GMT+0800 (China Standard Time)

Hi Shruti, the wikisql training data includes examples like this which have "conds": []

Sandy26 · Answer 20 · Thu Jul 05 2018 12:20:35 GMT+0800 (China Standard Time)

Hi Li,Stephen,

I have created a new model for my own dataset and it does give me reasonable results. Thank you for all your help so far.
Now I really need to add group by, order by, and limit functionality. I am going around in circles about what really needs to be done. But here are my few specific questions-

I plan to create sample data like-
{"phase": 1, "table_id": "1-1-1", "question":"Find baseball tickets by cities ?","sql":{"sel":[1,2] ,"conds":[],"agg":[0,3],"group":[1],"order":[],"limit":[]}}
so query should be -
SELECT City,COUNT(Tickets) FROM table GROUP BY City
Another example can be-
{"phase": 1, "table_id": "1-1-1", "question":"Find top 5 baseball tickets by cost ?","sql":{"sel":[3,2] ,"conds":[],"agg":[0,0],"group":[],"order":[1],"limit":[5]}}

How do I change sel from integer input to a list. Also for agg change integer value to list. With that change can I still use agg_classifier and sel_match (ModelConstructor.py line 100-104) or will that need to change?
As per my understanding the "lay" field includes list of operands in the conditions , hence keeps a track of number of conditions. Do I need a separate lay for each of "group" , "order" and "limit"? or should conditions,group, order and limit should all be in one "lay" field?
Should I use something like agg_classifier or matchscorer for modelling "group" , "order" and limit?

I understand that these quiet involved questions, but any help/ideas about the starting point would be greatly appreciated.

Thank you very much,
Shruti