USCDataScience / parser-indexer-py

Python tools for parsing documents and building the inverted index with enriched metadata. Java version with slightly different features - https://github.com/USCDataScience/parser-indexer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parser-Indexer

This project contains tools for parsing files and indexing to solr. It also includes tools for information extraction.

For instructions related to training the custom Named Entity Recognition model using Stanford CoreNLP, visit src/corenlp

Requirements

  1. Solr
  2. Parser Server

1. Setting up solr

Download Solr

mkdir workspace && cd workspace
wget http://archive.apache.org/dist/lucene/solr/6.1.0/solr-6.1.0.tgz
tar xvzf solr-6.1.0.tgz
cd solr-6.1.0

Start and Create a Core

PORT=8983
bin/solr start -p $PORT
bin/solr create_core -c docs -d $YOUR_PATH/conf/solr/docs -p $PORT

2. Parser Server

Refer to README of parser-server in sub directory.

Examples :

Checkout docs folder.

  • To parse and index jounrals : docs/parser-index-journals.md

About

Python tools for parsing documents and building the inverted index with enriched metadata. Java version with slightly different features - https://github.com/USCDataScience/parser-indexer

License:Apache License 2.0


Languages

Language:Jupyter Notebook 51.2%Language:Python 35.9%Language:XSLT 5.8%Language:JavaScript 5.4%Language:CSS 0.9%Language:HTML 0.7%Language:Shell 0.2%