ataki / deep-learning-gender

CS 224D Final Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

blog-gender-dataset

Maintains dataset generation procedure for our deep-learning project.

Author: Jim Zheng, Aric Bartle

Reduced Vocab

  • download frequency data
  • prune data to get top N%
  • output (word-vec => word mapping)
  • wordvector.txt, vocab.txt, vocab.pdb

Blog Cleanup

  • go through each blog
    • remove unicode
    • extract words without punctuation
    • all lowercase
    • num => DG
    • unknown vocab => UUNNGG
    • have param k that specifies max sent per ex

About

CS 224D Final Project


Languages

Language:Python 100.0%