phucng/geolocator-3.0

Read Me

Input: Grammatical text or twitter text -- input the full JSON file for best results in geocoding.

Functionality (geoparsing): Finds toponyms, street and building and business names. One version also outputs unnamed locations (such as "garage" or "parking lot").

Functionality (geocoding): Attaches latitude and longitude to topnyms (whether mined from documents or tweets). Does not assign latitude and longitude to streets or buildings.

The machine learning based NER for locations uses Conditional Random Fields, or Average Perceptron algorithms, and there are two rule based models in addition to that. So, if you would re-train the geolocator, we would suggest training directly with conditional random fields that you are familiar with, instead of understanding the geolocator package to get the model trained. For using the additional rule based algorithms, you could apply that along with the model you trained (English rule based parsers only, and for buildings and streets only).

/////////////// Introduction ///////////////

Tagging the command line input

The output format for the commandline and batch file: Each recognized location is one of those types: TP,tp, ST,st,BD,bd,AB,ab. TP, ST, BD, AB are output from the Named Entity Recognizer. tp,st,bd,ab are the output from the rule based and toponym lookup parsers. The major change is that another fine-grained NER module has been added, which is able to help extracting streets, business, un-named location. The output tags are TOPONYM, UNNAMEDLOCATION, BUSINESS, STREET. There may also be B, I attached in front of the tags, for example, B-STREET, meaning the beginning of a street. I-STREET means the internal part of the street. There is no tag for indicating end of the location string. Only B( beginning) and I(internal).

The geocoding result is able to output all the information that is stored in the GeoNames gazetteer for a location, such as country, state info, latitude and longitude, geographical feature type (whether it's a city, country, state, mountain, airport, or something. The meaning of the specific type can be looked up in GeoNames.org).

Note that the each geoparsing and geocoding output has a confidence value.

Geoparsing confidence value is generated by the aggregation of each confidence value generated from each parser: NER parser has confidence 0.85, toponym parser 0.65, and stbd parser with 0.85. If duplicates are recognized, then the confidence value adds up in formula a+0.1b+0.1c, where a, b, and c are the outputs of each parser.

Similarly, we generate the geo-coding results in a similar way, which uses the a+0.1b+0.1c formula, however, a, b, c is the confidence for MLGeocoder, minimalityGeoCoder results.

/////////////// How to Install: ///////////////

The algorithm can run on Windows, Mac, or Linux/Unix platforms.

1.Check out the project. In eclipse, try import ->project from git.

After checked out the project into Eclipse workspace, Go to the terminal (if you are using linux or mac osx), or cygwin for windows, cd to the geo-locator folder, run isntall.sh to install the software. This is a long process because we have to download jar files, resources from geonames, and most time-consuming is the indexing of the geoname. The estimate time is about 1 hour. It varies with your machine. To run the fuzzy match algorithm in edu.cmu.geoparser.nlp.spelling, please see the instructions in FuzzyGeoMatch project.

Please send email to wei.zhang@cs.cmu.edu or gelern@cs.cmu.edu if you find any bug or have any question, or any suggestions.

Thank you.

phucng / geolocator-3.0

About

Languages