Training data format for generating Scenario based MCQ's

Question

Training data format for generating Scenario based MCQ's

shrey10926 opened this issue 2 years ago · comments

I am following your rep to fine tune the model for generating Scenario based MCQ's based on text extracted from pdf's. The text extracted is unstructured and in a .txt file. I am unsure of how the training data format should look like and would appreciate some guidance on it. This is the expected format of the output:

Scenario: A driver checks "Yes" to "Neck or back problems" and "Fainting or passing out" on the Driver Health History. He indicated he sustained a back injury 2 years ago. He takes duloxetine 40 mg/day and 12 over-the-counter ibuprofen each day for lumbar degenerative disc disease.
Question: What should a medical examiner be most concerned about?
Options:
a) Nerve root compression on lumbar MRI or myelography.
b) Nystagmus on Hallpike vestibular provocative tests.
c) Orthostatic hypotension and a positive hemoccult test.
d) The renal side effects of both medications

https://6b.eleuther.ai/ This link has a playground of GPT-J where if you put 2-3 prompts like the above mentioned format, it will generate a new Scenario based MCQ. I want to fine tune the model such that given a couple prompts like above, the model should be able to generate a new Scenario based MCQ.

Any help/guidance is appreciated!
Thank you

samyakai · Answer 1 · Sat Jun 04 2022 11:54:39 GMT+0800 (China Standard Time)

The training data format should be the same as the input/prompt format to the model during inferencing. So if you want the output to be as you have shown above then each data point in the csv file should be the same format as well. Each data point should contain a Scenario, Question and Options.

Shrey Jain · Answer 2 · Sun Jun 05 2022 19:40:22 GMT+0800 (China Standard Time)

Yeah this works. Thanks a lot bud!