What are the requirements for the input text?

Question

What are the requirements for the input text?

indrajithi opened this issue 6 years ago · comments

Indrajith Indraprastham commented 6 years ago

To generate new questions this line is used where config-trans specifies the input text:

$> th translate.lua -model model/<model file name> -config config-trans

1) What is the requirement of the input text? (stop words or any other requirements)

For sentence level model.
2) What qualifies as a good sentence for generating a question?

For paragraph level model.
3) On what basis should I split the paragraph into a sentence.

For sentence level model, I tried to generate questions by splitting input text into sentences based on . and gave the text file path in src field in config-trans.

For paragraph level model, I did the . based splitting for src file and for par file, I repeated the paragraph for each sentence in src.

Input text:
One of the most basic techniques of molecular biology to study protein function is molecular cloning. In this technique, DNA coding for a protein of interest is cloned using polymerase chain reaction (PCR), and/or restriction enzymes into a plasmid ( expression vector). A vector has 3 distinctive features: an origin of replication, a multiple cloning site (MCS), and a selective marker usually antibiotic resistance. Located upstream of the multiple cloning site are the promoter regions and the transcription start site which regulate the expression of cloned gene.

src
One of the most basic techniques of molecular biology to study protein function is molecular cloning.
In this technique, DNA coding for a protein of interest is cloned using polymerase chain reaction (PCR), and/or restriction enzymes into a plasmid ( expression vector).
A vector has 3 distinctive features: an origin of replication, a multiple cloning site (MCS), and a selective marker usually antibiotic resistance.
Located upstream of the multiple cloning site are the promoter regions and the transcription start site which regulate the expression of cloned gene.

Where am I doing it wrong?

Note: From the paper ;
DirectIn is an intuitive yet meaningful baseline in which the longest sub-sentence of the sentence is directly taken as the predicted question. To split the sentence into sub-sentences, we use a set of splitters, i.e. , {“?”, “!”, “,”, “.”, “;”}.

Roshan Sridhar · Answer 1 · Thu Dec 13 2018 07:52:41 GMT+0800 (China Standard Time)

Hi, were you able to get around this?
I was wondering if we could obtain how they converted 'nqg/raw/' files to 'nqg/processed', then we should know how it was converted.

Sundeep Pidugu · Answer 2 · Fri Apr 12 2019 15:10:24 GMT+0800 (China Standard Time)

whats the format for the text file which needs to be replaced in the paragraph/preprocess_embedding.sh file ??

Can i use this file ? glove.840B.300d.txt

or is there a way to generate a embedding text file by my own ??

Indrajith Indraprastham · Answer 3 · Fri Apr 12 2019 15:32:02 GMT+0800 (China Standard Time)

whats the format for the text file which needs to be replaced in the paragraph/preprocess_embedding.sh file ??

Can i use this file ? glove.840B.300d.txt

or is there a way to generate an embedding text file by my own ??

--embedding ../../archive/embeddings/glove.840B.300d.txt need to be replaced with the location of your word-vec pre-trained model glove.840B.300d.txt which can be downloaded from here . You can create your own word2vec trained model but pre-trained word2vec is from common crawl dataset, which I think is pretty good.

suresh96458 · Answer 4 · Mon Feb 03 2020 19:55:20 GMT+0800 (China Standard Time)

@indrajithi
where you able to give your own custom data as an input ?

suresh96458 · Answer 5 · Thu Mar 19 2020 14:40:09 GMT+0800 (China Standard Time)

@xinyadu can you help on how to give custom input data for predictions also as we have no idea how to convert from raw data to processed folder.