cipang / wellington

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wellington

########################GENERAL####################################
This project is working for binary classification with TF-IDF model and Embedding Model, while the project need the input of training data and testing data, the project has provided the way of connecting mongodb and directing to original file to do classification

########################COMMAND#####################################
Running with command line, there are three kinds of command pattern
(1)connect to mongodb

command pattern->   
		A. -c ipaddr portnum username password dbname collection txtid txt limitation skip  

-c means connecting to mongo db, in the command the variables are referring to

		 1. ipaddr -> ip address of the mongodb to connect;  
		 2. portnum-> port number to connect the db;  
		 3. username-> username of the rights access to the db;  
		 4. password-> password matching the username above;  
		 5. dbname-> sub database storing the data to be predicted in the mongo;  
		 6. collection-> table of storing the detailed data in the sub database;  
		 7. txtid->key word in the table; item id of the data content, uniquely 
		           indentifying each document;  
		 8. txt->key word in the table;referring to the detailed data content.  
		 9. limitation-> limited number of the collected data items from database  
		 10.skip-> start position of collecting data items from database  

the data collected from database will be stored after the name of the trainFile name in
the next model selection part of named after "test.xlsx"

(2)select classifying model

command pattern->  
                   A. -t trainFile testFile  
                   B. -e trainFile testFile dictionary  
				   C. -b trainFile testFile dictionary  

-t means only use tf-idf model to do classification
-e means only use embedding model to do classification
-b means use both of the two model to do classification
in the command the variables are referring to

         1. trainFile->xlsx file name, store the training data;  
         2. testFile->xlsx file name, store the data to be classified;  
		 3. dictionary->file holding the word embedding vectors, if set 
						as -1, the project will training its own word 
						vector dictionary with training file and testing 
						file.  

in this part, training xlsx file and test file should be in such a xlsx

			cell pattern:  
			id     plain_text   classifcation  

where plain_txt refers to the detailed content of one data item to be classified
id refers to the unique identification of one data ithem
classification referes to the classification result of the data item

(3)combination

command pattern->  
			A. -n oldFile newFile    

-n means combine oldFile and newFile to a file named after oldFile
in the command the variables are referring to

					1. oldFile->xlsx file name, store the old training data  
					2. newFile->xlsx file name, store the new data could be 
								used as training data  

output file's name of different model:

					A-> tt_result.xlsx(positive result)+nt_result(false positive result)  
					B-> te_result.xlsx(positive result)+ne_result(false positive result)  
					C-> t_result.xlsx(positive result)+n_result(false positive result)  

running c will also run A and B, which means result of A and B will alsp appear when running C

#############################DESCRIPTION##################################
during the classification process, there will be several middle files:

			1.  segments.txt-> in tf-idf model, segementation result of each item with  
								simple word frequency  
			2.  label.txt-> in tf-idf model, each word with its tf-idf value  
			3.  labelF.txt-> in tf-idf model, selected words as attributes with high tf-idf values    
			4.  train1.txt-> in tf-idf model,training data matching libsvm input pattern  
			5.  train2.txt-> in tf-idf model,testing data matching libsvm input pattern  
			6.  model_r.txt-> in tf-idf model, model result of training with data on libsvm   
			7.  out_r.txt-> in tf-idf model, predicting results of testing data based on training model  

			8.  sentences.txt-> in embedding model, simple segementation result of each item  
			9.  vec.txt—> in embedding model, word vector dictionary after self-training with own material  
			10. trainE1.txt-> in embedding model,training data matching libsvm input pattern  
			11. trainE2.txt-> in embedding model,testing data matching libsvm input pattern  
			12. modelE_r.txt-> in embedding model, model result of training with data on libsvm  
			13. outE_r.txt-> in embedding model, predicting results of testing data based on training model  

########################EXAMPLE##################################

1. -c  115.146.92.242 27017 healthvis 2018healthvis healthvis wap_posts_data _id post_plaintext 50000 0  

which means connect mongodb 115.146.92.242:27017 with username and password healthvis,2018healthvis,
then collect data in sub database healthvis, table wap_posts_data, then select the value in key words
_id and post_plaintext, with 50000 items, from the start of the table.

2. -b train.xlsx test.xlsx sgns.weibo.word  

which means use both embedding model and tf-idf model to train data in train.xlsx, and classify the data
in test,xlsx, used vector dictionary in embedding model is sgns.weibo.word

3. -n old.xlsx new.xlsx  

which means combine the data in old.xlsx and new.xlsx and the result will store in old.xlsx

be careful, when running in IDE, all the input file should be in the project root document
when running with jar file, command should start with java -jar and all the file input
should be in the document where jar file is ,like

java -jar -b train.xlsx test.xlsx sgns.weibo.word 

######################RESOURCE SUPPORT####################################
https://deeplearning4j.org/cn/archieved/zh-word2vec
https://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
https://nlp.stanford.edu/software/
https://www.jiqizhixin.com/articles/2018-05-15-10

About


Languages

Language:Java 100.0%