Natural Language Processing
This is the source code for the entity linking project which is part of the natural language processing course at ETHZ. It is built using Julia v0.3.6.
Installation
Make sure you have julia installed and can call julia
from the command line. Clone this repository to a location on your computer, and browse to that location. Create a folder cache/
in the repository folder. Download the data.zip
file (1.5GB zipped, 6GB unzipped) and unzip it to the repository folder.
Dependencies
This julia project has dependencies on several julia packages:
DataStructures
: Priority queueDistances
: Cosine distance metricGumbo
: HTML parsingGZip
: Streaming gzip filesIterators
: Iterative mappingJSON
: JSON parsingLightXML
: Reading/writing XML filesMatch
: Scala-like match/case statementsPyCall
: Calling python libraries from juliaRequests
: HTTP requests
You can install these dependencies by running julia Dependencies.jl
in the repository folder.
If you wish to train the LDA model yourself, you will also need vw
, python
and the nltk
python package. Instructions can be found in the scripts/vw-wikipedia.jl
file. You can download the wikipedia corpus in JSON format, on which we trained our LDA model, here (5GB zipped).
Run
To run the application:
julia Main.jl <algorithm> <query-file> [output-file]
Where the algorithm
is either tagme
, naive
or lda
. The query-file
parameter should be the path to the XML file that has the queries. Optionally you can specify an output file to which the annotator will write its predictions in XML format.