sgrimes / nytimes-solr-indexer

Solr Indexer for the New York Times Annotated Corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Solr Indexer for The New York Times Annotated Corpus
---
This script will untar and index the New York Times Corpus NITF xml dump provided by the Linguistic Data Consortium (LDC).

It also contains an example solr config files to help you get started. These are just examples and you should change these to fit your system!

To use the solr schemas, copy schema.xml and solrconfig.xml to your solr/config folder.

To run the indexer script:
1. install jruby from http://jruby.org/ 
2. install rsolr ruby gem: jgem install rsolr rsolr-direct
3. Copy timestools.jar from the "nytimes data/build/timestools.jar" directory
4. jruby nytimes_solr_indexer.rb "path/to/nytimes data/data" "/path/to/solr/home"

*You can also post to a solr server by specifying a solr server in the second parameter:
 jruby nytimes_solr_indexer.rb "path/to/nytimes data/data" http://localhost:8983/solr

NOTES:
Tested on Mac OSX 10.6 with JRuby 1.4.0. Should work on any linux systems.
I use the tar unix app so Windows users may need to download cygwin or msys. Alternatively, you can comment the line and untar manually.

The included solr schema.xml file is all-inclusive by default. It is set to index and store all fields.
If you do not intend to use all fields, you can notice a improvement by not indexing the unused fields or not storing the fields.
In schema.xml, just comment out a field
<!-- <field name="biographicalCategories" type="string" indexed="true" stored="true" multiValued="true"/> -->
or remove stored="true" to not store the field.

Refer to the Solr documentation for more information: http://wiki.apache.org/solr/SchemaXml

--
May 13 2010:
README update.

May 5 2010:
Added missing onlineLeadParagraph. There are now 50 fields in the indexer and 50 in the provided solr schema. 
Commented out data import handler. There is an example data-config.xml file provided with solr if you wish to use it.
Thanks to Aditi S Muralidharan.

May 4 2010:
Added require 'date' and uncommented untar command.
Thanks to Aditi S Muralidharan.


Questions, bugs or need some help?
Contact the author Tommy Chheng
@tommychheng
http://tommy.chheng.com

About

Solr Indexer for the New York Times Annotated Corpus