LYHTHU / nlp_hw_07

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NLP Homework 07

Yunhao Li, NetID:

Best dev score with and without the embeddings

  • Without embeddings: F1 = 82.93

  • With embeddings(fixed threshold): F1 = 80.29

  • With additional methods and embeddings:
    • With binarization:
      F1 = 80.01

    • With cluster:
      N_cluster = 20
      F1 = 84.94
      Note: the result of cluster is not fixed every time. Because the K-means has some stochastic process.

Base

  • Implemented the binarization with a fixed threshold.
    I use a single threshold to binarilize all 50 dimensions.
    And the threshold I find best is:
    Threshold = 0.18
    F1 = 80.29

The additional methods:

  • Implemented the binarization described in Revisiting Embedding Features for Simple Semi-supervised Learning. For each dimension, I calculate the mean of all the positive values and the mean of the negative values. Then I use these two means (trshd_pos[0..49], trshd_neg[0..49]) as the threshold of the value. If the value vec[i] that is greater than trshd_pos[i], then use +1 as the binarized result; or if it is less than trshd_neg[i], then use -1 as the result; or the result is 0. And finally I put the 50 dimensions of the word as 50 features and add them into the feature file to train the model or get tagged.

+ Implemented the cluster method.
I used the K-means method provided by `sklearn`. And running the kmeans on the word embedding vectors of glove.6B.50d.txt. After training I use the trained model to predict the class for each word, and add its class as a feature. I tried several `N_cluster` params and finally choose `N_cluster = 20`.

Required library

  • sklearn
  • numpy

Running

Create a folder named ./glove.6B, and move the glove.6B.50d.txt into it. Put the corpus files and MEtag.java and MEtrain.java and and the required jar file in ./. Set the inmode in the WordEmbedding.py to set the embedding methods.

  • "bin" -> Binariazation method with fixed threshold
  • "bin_mean" -> Additional binariazation method with mean-value threshold
  • "cluster" -> Cluster methods with K-means.

Then run the WordEmbedding.py.

About


Languages

Language:Python 84.0%Language:Java 16.0%