J0hnG4lt / AdeIndexer

A command line tool that uses Lucene to build an inverted index on a folder with .txt files and allows for the execution of efficient searches on it.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AdeIndexer

A command line tool that builds an inverted index on a folder with .txt files and allows for the execution of efficient searches on it.

Implementations

Custom in-memory solution (this indexer is used by default)

A custom in-memory index was built with two Scala Collections:

  • A mutable indexed sequence whose values are file paths and whose indices represent document IDs. This collection was chosen for its constant-time lookup.
  • A mutable Hash Map for efficient lookups and updates. Its keys represent words and its values represent Hash Sets of document Ids. Document IDs are stored instead of Paths to reduce memory requirements.

With Lucene

By default, an index directory will be created on the current directory if no -i option is specified. This solution builds the inverted index on the file system rather than having it in memory. To use this indexer, use the following option: -n Lucene.

Layout and conventions

src/
├── main
│   ├── resources
│   │   └── logging.properties
│   └── scala
│       └── AdeIndexer
│           ├── cli
│           │   └── ArgParser.scala
│           ├── config
│           │   ├── ArgParser.scala
│           │   └── Indexer.scala
│           ├── exceptions
│           │   └── CustomExceptions.scala
│           ├── indexer
│           │   ├── custom
│           │   │   ├── CustomIndexer.scala
│           │   │   └── CustomSearcher.scala
│           │   ├── lucene
│           │   │   ├── CountSimilarity.scala
│           │   │   ├── LuceneIndexer.scala
│           │   │   └── LuceneSearcher.scala
│           │   ├── SearcherBase.scala
│           │   └── SearcherFactory.scala
│           ├── logging
│           │   └── LoggerUtils.scala
│           ├── Main.scala
│           ├── postprocessing
│           │   └── Scaler.scala
│           └── repl
│               └── IndexingRepl.scala
└── test
    ├── resources
    │   ├── names2.txt
    │   ├── names3.txt
    │   ├── names.txt
    │   └── something_else.yaml
    └── scala
        └── AdeIndexer
            └── indexer
                ├── custom
                │   ├── CustomIndexerSuite.scala
                │   └── CustomSearcherSuite.scala
                └── lucene
                    ├── CountSimilaritySuite.scala
                    ├── LuceneIndexSuite.scala
                    └── LuceneSearcherSuite.scala


Requirements

  • openjdk version "17"
  • Maven 3.8.1
  • Developed with IntelliJ

Usage

Logging

Use the parameter -Djava.util.logging.config.file=src/main/resources/logging.properties with java to use a FINE logging level.

Get help

java -Djava.util.logging.config.file=src/main/resources/logging.properties -jar ./target/adeindexer-0.0.2-SNAPSHOT.jar --help

The following should appear:

AdeIndexer 0.0.2
Usage: AdeIndexer [options]

  -d, --directory <value>  d is the path to a directory with files that will be indexed.
  -i, --index-directory <value>
                           i is the path to a directory where the index will be stored
  -n, --name-indexer <value>
                           n is the name of the indexer that will be used. Options: Lucene, Custom
  -q, --query <value>      q is the query
  --help                   prints this usage text

Build an inverted index and execute a single search:

java -Djava.util.logging.config.file=src/main/resources/logging.properties \
  -jar ./target/adeindexer-0.0.2-SNAPSHOT.jar \
  -d src/test/resources/ \
  -q "Georvic Victoria"

Something like the following should appear:

Map(/home/georvic/repos/infra/AdeIndexer/src/test/resources/names.txt -> 100.0, /home/georvic/repos/infra/AdeIndexer/src/test/resources/names2.txt -> 33.333336, /home/georvic/repos/infra/AdeIndexer/src/test/resources/names3.txt -> 0.0)

Build an inverted index and wait for user input:

java -Djava.util.logging.config.file=src/main/resources/logging.properties \
  -jar ./target/adeindexer-0.0.2-SNAPSHOT.jar \
  -d src/test/resources/

Something like the following should appear:

 Hello, welcome to the AdeIndexer!


 To search for ocurrences of any set of words, just enter them separated by a space.
 For Example:
 search> Deutschland Frankreich
 You can use the following commands: :quit, :help

search> 

Introduce your queries like this and press enter:

search> Georvic Victoria

Something like the following should appear:

Map(/home/georvic/repos/infra/AdeIndexer/src/test/resources/names.txt -> 100.0, /home/georvic/repos/infra/AdeIndexer/src/test/resources/names2.txt -> 33.333336, /home/georvic/repos/infra/AdeIndexer/src/test/resources/names3.txt -> 0.0)

References

About

A command line tool that uses Lucene to build an inverted index on a folder with .txt files and allows for the execution of efficient searches on it.


Languages

Language:Scala 100.0%