nooralahzadeh / feat2vec

Code of NAACL paper "Unsupervised Multi-Domain Adaptation with Feature Embeddings"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Embedding

Author: Yi Yang


Basic Description

Python code for


  • Install gensim by
    • pip install --upgrade gensim
  • If you want a faster version of this tool, you may also want to
    • install Cython by
      • pip install cython
    • compile the code by running
      • python build_ext --inplace


A demo for saving feature embeddings to a txt/bin file is available (python -h).

Given a feature file (data/twitter_feat.txt) in which each line corresponds to features of one instance, save feature embeddings to a txt file (data/twitter_embeddings.txt):

  1. If features employ bag-of-word (BoW) representation (no feature templates involved)
  • python --bow 1 --dim 25 data/twitter_feat.txt data/twitter_embeddings.txt
  1. If features employ structured representation (extract features by feature templates), and given the feature-template mapping file (data/twitter_feat_template.txt)
  • python --feature_template_file data/twitter_feat_template.txt --dim 25 data/twitter_feat.txt data/twitter_embeddings.txt
  1. If features employ structured representation (extract features by feature templates), and given the template prefix file (data/twitter_template_prefix.txt)
  • python --template_prefix_file data/twitter_template_prefix.txt --dim 25 data/twitter_feat.txt data/twitter_embeddings.txt

See save_features method of for how to generate data/twitter_feat.txt and data/twitter_feat_template.txt files given files in CONLL POS format.

Domain Adaptation for Twitter POS tagging

A light demo for part-of-speech tagging of tweets is also provided, using data from CMU Twitter NLP project.

oct27 dataset is regarded as source data, and daily547 dataset is regarded as target data. We also sample some unlabeled tweets randomly (see data/twitter folder).

Run the demo:

  1. Prepare the data (extract features, select pivots, etc.) by running
  • python
  1. Obtain the baseline (no adaptation) SVM tagging results by running
  • python none
  1. Obtain the marginalized Denoising Autoencoders adaptation results by running
  • python mldae
  1. Obtain the feature embedding adaptation results by running
  • python feat2vec

The first step will create a file data/dataset_twitter.pkl. I got results of 0.8839, 0.8889 and 0.8924 for step 2, 3 and 4. The feat2vec results may vary a litter due to the negative sampling technique. You should obtain even better results with feat2vec by using more unlabeled data.


Code of NAACL paper "Unsupervised Multi-Domain Adaptation with Feature Embeddings"


Language:Python 99.3%Language:C 0.7%