rjagerman / nlp

Natural Language Processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Natural Language Processing

This is the source code for the entity linking project which is part of the natural language processing course at ETHZ. It is built using Julia v0.3.6.

Installation

Make sure you have julia installed and can call julia from the command line. Clone this repository to a location on your computer, and browse to that location. Create a folder cache/ in the repository folder. Download the data.zip file (1.5GB zipped, 6GB unzipped) and unzip it to the repository folder.

Dependencies

This julia project has dependencies on several julia packages:

  • DataStructures: Priority queue
  • Distances: Cosine distance metric
  • Gumbo: HTML parsing
  • GZip: Streaming gzip files
  • Iterators: Iterative mapping
  • JSON: JSON parsing
  • LightXML: Reading/writing XML files
  • Match: Scala-like match/case statements
  • PyCall: Calling python libraries from julia
  • Requests: HTTP requests

You can install these dependencies by running julia Dependencies.jl in the repository folder.

If you wish to train the LDA model yourself, you will also need vw, python and the nltk python package. Instructions can be found in the scripts/vw-wikipedia.jl file. You can download the wikipedia corpus in JSON format, on which we trained our LDA model, here (5GB zipped).

Run

To run the application:

julia Main.jl <algorithm> <query-file> [output-file]

Where the algorithm is either tagme, naive or lda. The query-file parameter should be the path to the XML file that has the queries. Optionally you can specify an output file to which the annotator will write its predictions in XML format.

About

Natural Language Processing


Languages

Language:Julia 100.0%