Creates a program that takes a file like WSJ_02-21.pos-chunk as input and produces a file consisting of feature value pairs for use with the maxent trainer and classifier.
- Features of the word itself: POS, bio, word, capitalization
- Implemented a check to see whether word is a name or not
- Used a dictionary of words that may indicate a name (Mr., Mrs, Dr., etc)
- Words that followed these terms and were captilized were termed names
- Words that followed/preceded names that were capitilized were also deemed names as they are most likely part of the same name
- Implemented a check to see whether word is an organization or not
- Used a dictionary of words that may indicate an org (Inc., Incorporated, Co.)
- Words that preceded/follwed words with organization tag and that were capitilized also were tagged as organization
- Applied above features to previous/next 6 words
- 31390 out of 32853 tags correct
- accuracy: 95.55
- 8378 groups in key
- 8870 groups in response
- 7624 correct groups
- precision: 85.95
- recall: 91.00
- F1: 88.40