donglixp / coarse2fine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

To train a model on a different SQL table

Sandy26 opened this issue · comments

Hi Mr.Li,

I have download the entire code package. I would like to train a model for my own SQL table. From what I understand I need to-

  1. Annotate my table using annotate.py
    2)Then use preprocess.py,train.py,evaluate.py.

But I am confused about the input data to annotate.py. In what format should I give my SQL table to annotate.py? And once that part runs through rest scripts should technically run ok, correct? Or do I need to make any other format changes?

Any guidance is greatly appreciated! Thank you!

Hi @Sandy26 ,

The WikiSQL data format can be found in https://github.com/salesforce/WikiSQL . Thanks!

@Sandy26 I've a notebook that runs through the steps of adding and annotating a new table and query in https://github.com/stprior/coarse2fine/blob/predict/wikisql/Exploration.ipynb . Note I'm just getting familiar with this so it may well have errors.

@stprior Solved in #4 .

@stprior Thank you so very much. This is very helpful!!!. I will let you know how it goes :) cos I believe I need to add one more step and actually train a model on my sql tables. Thank you once again for providing a solid first step in that direction!
@donglixp Thank you for the piece of code mentioned above

@stprior : Just want to make sure I understand this correctly . In your notebook(https://github.com/stprior/coarse2fine/blob/predict/wikisql/Exploration.ipynb)-

"Set up a question. The SQL field describes the expected query when training or testing. In this notebook it is not used, but it should still make sense for the table (e.g. conds should not specify a column number which is not in the table)."

In [57]:
question = {"phase": 1,
"table_id": "1-10753917-1",
"question": "How many wins did the ferrari team have after 1950 and before 1960?",
"sql": {"sel": 1, "conds": [[2, 0, "Williams"], [8, 0, "2"]], "agg": 3}
}

So ideally in a "test" question I should not need an "sql:" part correct? But I should give some value (with any columns in the table) just so that the code doesn't break?
Thank you,
Sandy

@Sandy26 yes, if you are just using the pretrained model to generate SQL, I've had good results for unseen tables and queries just doing that. If your new tables or queries that are substantially different to those in the WikiSQL data set you would get better results by providing training data and training a new model or modifying the existing one. Also note there was an error in the notebook which I've just fixed, and I intend to use the code from #4 which should improve results too.

Hi Stephen,
Your python notebook was very helpful.I was able to follow the steps to add my own table and question.
But the code works or crashes depending upon my question. For some questions it works , that is gives me a query .But for some questions it crashes giving this error-

Traceback (most recent call last):
File "test_mytable.py", line 99, in
result_list=translator.translate(batch)
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Translator.py", line 89, in translate
op_batch_list, self.fields['cond_op'].vocab.stoi[table.IO.PAD_WORD]).t())
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Utils.py", line 46, in add_pad
return torch.LongTensor(r_list).cuda()
RuntimeError: given sequence has an invalid size of dimension 2: 0
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/init.py", line 5, in
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in
import apport.fileutils
File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in
from apport.packaging_impl import impl as packaging
File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 20, in
import apt
File "/usr/lib/python3/dist-packages/apt/init.py", line 23, in
import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):
File "test_mytable.py", line 99, in
result_list=translator.translate(batch)
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Translator.py", line 89, in translate
op_batch_list, self.fields['cond_op'].vocab.stoi[table.IO.PAD_WORD]).t())
File "/home/sp52650/stprior_code/coarse2fine/wikisql/table/Utils.py", line 46, in add_pad
return torch.LongTensor(r_list).cuda()
RuntimeError: given sequence has an invalid size of dimension 2: 0

And even for the questions that get a query, it has incorrect column numbers. I think that is telling me that the current model is not working for me and I need to train a new one with my own table. Does that sound correct ? Also any thoughts why I might get the above error for some questions and not for others? as of now, I have failed to find a pattern in the failing and succeeding questions.

Thank you once again,
Sandy

Hi @Sandy26 ,

The exception "RuntimeError: given sequence has an invalid size of dimension 2: 0" is caused by that all the queries in the current batch do not have WHERE clauses. So the "r_list" contains empty lists. A quick fix would be changing the https://github.com/donglixp/coarse2fine/blob/master/wikisql/table/Utils.py#L41 into:

max_len = max(1, max((len(b) for b in b_list)))

which would add padding tokens for "r_list".

Thanks!

Hi Li,
Thank you for that suggestion. It works, as in at least the code gets to the end. What I realised is if I were to ask a grammatically incorrect/incomplete question (As in a question that may be asked by someone for whom English is not the first language) my current model cannot find the where clause and hence the error.
For example- If the question is-
How many low priority items? (No verb phrase)=> the model is unable to find "where priority=low"
But if the question is-
How many items have low priority? => I get where col0=low

The last problem part is though col0 is not the "priority" column. But that I think should get better once I train the model on my own data, correct?

Just wanted to let you know about my findings.

Thank you,
Shruti

Hi @Sandy26 ,

It would be better to train the model on your dataset if the questions are of different patterns. Thanks!

Hi Li,
Since I will be feeding the code a new table, I tried to find where the column names for the table are mapped to numerical values like 0,1,2... ? Or does the training data take care of it? As in when we give the "sql: " part in the train.json file?

Thank you,
Shruti

Hi Li,Stephen,

Any thoughts on how difficult it will be to add "DISTINCT" and "LIMIT" functionality to this model? As in say the question is -
"Tell me any three types of fruits in stock."
then ideally my query will be-
SELECT DISTINCT Fruits FROM Table WHERE In_stock=1 LIMIT 3

Just curious!
Thank you,
Shruti

Hi Shruti,

Adding new clauses like DISTINCT and LIMIT would be possible, but difficult. It might be possible to treat it as a new aggregate category and layout, but the WikiSQL lib code would need to be changed to include this new operation, and the coarse2fine code would need to be rewritten too. You would also need plenty of training examples, and would need to train the model from scratch because the pre-trained model would not be usable.

Training to target the existing supported SQL syntax would be easier, training could start from the existing pre-trained model and less training data required. The column numbers just come from the order column descritions they appear in the header JSON entries of the table files.
Regard,
Stephem

Hi Stephen,
Thank you for sharing your thoughts. I don't think I really understood when you said- "Training to target the existing supported SQL syntax would be easier, training could start from the existing pre-trained model and less training data required".
Do you mean that I can use the existing pretrained model and some how train it to include some examples from my table? So should I just append my annotated questions to train.jsonl and train the model? Or is there a better way to do it?
Thank you,
Shruti

Hi Stephen,
Just appending to my previous question- I guess to start with pretrained model and add my own training data to it- Is it about line 177 in train.py?

Load checkpoint if we resume from a previous training. ?

But I don't know how to use it (I have the pretained model, my new annotated questions and table but not sure how I process to use it)

many thanks,
Shruti

I haven't actually tried training the model yet, but I plan to in the next week or so - I'll put up some notes when I do. If you add data to the existing dev train and test data files and run annotate.py on them you shouldn't need to make code changes, you should be able to follow the top level run.sh script. I don't know how many training examples you would need to make a difference to the model though. Alternatively you could train using mostly or only your own examples, but the quality of the model for more general queries would probably drop then.

Hi @Sandy26 ,

Supporting these queries needs some code modifications and new annotated examples. Then the new model can be trained from scratch.

Hi Stephen,Li,

I tried to train my model using my ~200 queries+ 300 from WikiSQl. I tried to keep them evenly matched so that my data gets better say. As expected the accuracy is very low but at least does better than pretrained model as is for my queries. While annotating queries though I realised that for training queries, there has to always be a where clause? As in cond cannot be like cond:[[, , "" ]]. So it how can one train very simple queries like-

What is the total number of transactions?
SQL: Select count(transactions) from table X

Or can I tweek annotate.py that can give cond:[[]] a harmless value?

Thank you,
Shruti

Hi Shruti, the wikisql training data includes examples like this which have "conds": []

Hi Li,Stephen,

I have created a new model for my own dataset and it does give me reasonable results. Thank you for all your help so far.
Now I really need to add group by, order by, and limit functionality. I am going around in circles about what really needs to be done. But here are my few specific questions-

I plan to create sample data like-
{"phase": 1, "table_id": "1-1-1", "question":"Find baseball tickets by cities ?","sql":{"sel":[1,2] ,"conds":[],"agg":[0,3],"group":[1],"order":[],"limit":[]}}
so query should be -
SELECT City,COUNT(Tickets) FROM table GROUP BY City
Another example can be-
{"phase": 1, "table_id": "1-1-1", "question":"Find top 5 baseball tickets by cost ?","sql":{"sel":[3,2] ,"conds":[],"agg":[0,0],"group":[],"order":[1],"limit":[5]}}

  1. How do I change sel from integer input to a list. Also for agg change integer value to list. With that change can I still use agg_classifier and sel_match (ModelConstructor.py line 100-104) or will that need to change?

  2. As per my understanding the "lay" field includes list of operands in the conditions , hence keeps a track of number of conditions. Do I need a separate lay for each of "group" , "order" and "limit"? or should conditions,group, order and limit should all be in one "lay" field?

  3. Should I use something like agg_classifier or matchscorer for modelling "group" , "order" and limit?

I understand that these quiet involved questions, but any help/ideas about the starting point would be greatly appreciated.

Thank you very much,
Shruti