arvindv17 / Lucene-Document-Search

This is a simple Java project to perform a word search from a directory of documents. It can handle multiple Document types, from PDF to txt to XML.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lucene-Document-Search

This project is aimed at handling the indexing and query search of a directory of documents. The user can search for a query, and will get a list of all documents which contain that search query. This search tool handles PDF, XML, HTML and TXT files.

Description

This keyword based document search is based on Lucene.Lucene is a java library that adds text indexing and search capabilities. It comes as an API which can perform these activities. Lucene offers two main services: text indexing and text searching. Lucene is a JAVA- full text engine. It is not a complete application but a code repository and an API which can be integrated to build applications.

Indexing

For every search engine, crawler or any application that performs any kind of search and retrieve activity, it is extremely important to perform indexing of that application. Indexing is also performed at databases and any application that is used to retrieve data. Indexing is a highly cross reference lookup, which allows better and faster retrieval of the results. Indexing in a database is the faster way to retrieve a result in a much faster way. It is done by making a reference to the memory for the row of the data. This allows better lookups and easier retrieval of the results. The process in which this result retrieved is better is called as indexing and this result is called as an index.

Inverted Index

It is a kind of indexing concept for mapping of data in which maps the contents to the locations on a particular location rather than vice versa. This is a much faster retrieval result. This is a better way to retrieve data for a text based search results. Example of an inverted index is to search for the contents of a book by searching at the index and glossary at the back of the book, rather than searching through the entire book to return the result. Lucene follows the concept of inverted index. Hence it is one of the best applications to use for creating of text search applications.

Code Links

  1. Controller
  • Different File parsing logic
  • Indexing Logic
  • Home Controller for an MVC application logic
  1. Highlighter
  • Highlighter for Lucene to display in HTML page
  1. Quality Check
  • Precision and Recall logic

Results

Once the search is complete, Lucene displays the result as an HTML document. The highlighted or the marked text is marked as a B tag in HTML. This is generated by using a highlighter class that is available with Lucene. Highlighter class is used to highlight the key terms and texts in the search. The fragment that is extracted contains the maximum occurrence.

About

This is a simple Java project to perform a word search from a directory of documents. It can handle multiple Document types, from PDF to txt to XML.


Languages

Language:Java 69.1%Language:HTML 30.9%