JupiterSearch is an easy to set up distributed text search database that is designed for searching for unique information or keywords like serial numbers, email addresses, and domain names from huge amounts of unstructured data, for example, websites, documents, and emails.
What JupiterSearch offers you:
- Easy to set up
- Suitable for unstructured data like emails/documents/web pages
- Can handle terabytes of data
- Client library
- Trivial horizontal scaling
What JupiterSearch is NOT good for:
- Relational data
- Extremely sensitive data (because HTTPS is not enabled by default and keys are stored plaintext in conf files)
Todo (in chronological order from oldest to newest):
- Custom tokenization
- HTTPS
- Multiple queries
- Docker
- Make repo public
- Github wiki
New improvement ideas are welcome :)
- Go (preferably the latest version)
- Linux -based system or any other OS with Docker
- At least 2GB of disk space
Download JupiterSearch either by using git clone
or by downloading and unpacking the zip file on this page.
git clone https://github.com/R00tendo/JupiterSearch
Run make install
as root to automatically download the dependencies, compile the programs, and install JupiterServer, JupiterNode, and JupiterClient on your system (/usr/local/bin).
sudo make install
You can also run JupiterSearch with docker. To do so, begin by creating a network.
docker network create --subnet 172.18.0.0/16 JupiterSearch
By default, the IP range for this newly created network will be 172.18.0.0/16
Before building and running the images, configure the settings to your liking at configs/ (do not edit data dir if you won't use -v)
Now build the image(s)
# JupiterNode:
docker build -t jupiternode -f JupiterNode-Dockerfile .
# JupiterServer:
docker build -t jupiterserver -f JupiterServer-Dockerfile .
Run the image(s)
# JupiterServer:
docker run --net JupiterSearch --ip 172.18.0.50 jupiterserver
# JupiterNode:
docker run --net JupiterSearch -p 9190:9190 --ip 172.18.0.51 jupiternode #Change 9190:9190 to the correct ports if you changed the defaults
If you want persistent storage, use the -v flag to mount a directory from your host system to the docker image's data directory.
docker run --net JupiterSearch -p 9190:9190 --ip 172.18.0.51 -v pathfromhostsystem:/JupiterSearch/data jupiternode
-
- *
datadir
: Path to where the database will be stored in - *
max_concurrent_ingests
: Amount of concurrent store requests that are allowed - *
name
: The name that will show as the source for results when you query something
- *
-
- *
client_key
(IMPORTANT): This is essentially the password for the whole system. Clients authenticate using this. - *
nodes
(IMPORTANT): List of nodes separated by a space like this:nodes=http://127.0.0.1:9192 http://127.0.0.1:9193
- *
-
- *
api_listen
: What host the rest API will be binded to - *
node_key
: A key that the master server will use to authenticate itself to the node tls_cert
: Location to a public certificate (for encrypted rest API traffic)tls_private
: Location to a private key (used with tls_cert)
- *
Open /etc/JupiterSearh/JupiterNode.conf with your favorite text editor on the machine you want to use as a node.
When you open the file, you will be greeted with these default settings:
name=main_node
datadir=data
api_listen=127.0.0.1:9192
node_key=JupiterKey
max_concurrent_ingests=5
Most of these you can leave to default, but I highly recommend changing the key
, since if you don't, and bind JupiterNode to all interfaces, anyone on the network could get access to your node.
Unless you're planning to use JupiterSearch on a single machine that runs both the JupiterServer and JupiterNode, you would want to change api_listen
to bind all interfaces or just your specific network adapter:
api_listen=0.0.0.0:9192
Open /etc/JupiterSearh/JupiterServer.conf with your favorite text editor on the machine you want to use as the master server (the one clients can use to store and query data).
These are the default settings:
api_listen=127.0.0.1:9190
node_key=JupiterKey
client_key=changeme
nodes=http://127.0.0.1:9192
Change the client_key
to something strong and random. Think of it as an API key. A client that has it can do everything.
If you changed node_key
from the defaults in the node configs, set the same key as a value for node_key
on the server configs as well.
Add your nodes to the nodes
variable, separated by a space character.
By default, JupiterSearch extracts all the words and other information by running this regex against the data: [\w+.+_+@]{4,}
.
However, you can customize this by editing the regex found in /etc/JupiterSearch/tokenization_regex.
There are two ways you can run JupiterNode and JupiterServer.
- As a service
- Commandline
I recommend first running both on the commandline with the
--debug
flag to make sure everything is working, but after that, it would be easier to run them as a service.
JupiterServer:
JupiterServer --start --debug
JupiterNode:
JupiterNode --start --debug
JupiterServer:
systemctl start JupiterServer
JupiterNode:
systemctl start JupiterNode
Remember to run JupiterNode first, since JupiterServer tries to connect to all the nodes within the config file, and if it is unsuccessful, it will ignore the node(s).
Unless you want to code a client yourself, using JupiterClient is a solid option for manually operating JupiterSearch.
JupiterClient syntax:
JupiterClient --server <master server url> --key <client_key> <arguments>
Example:
JupiterClient --server http://127.0.0.1:9190 --key 3ms9dk2lfhs83bf9s20 --upload movies.json
If you don't want to use JupiterClient or want to integrate JupiterSearch into your Golang projects, you may be interested in creating your own client. Fortunately, this is very easy with the help of the JupiterSearch client library.
Read more here: https://github.com/R00tendo/JupiterSearch/wiki/Client-library-usage
JupiterSearch consists of three parts:
- The client
- The master server
- The node(s)
By client, I refer to any program that wants to store or query data from JupiterSearch. This could be the official client (JupiterClient) or another one that someone built using the client library.
The master server is the service clients interact with. It keeps track of all the nodes, removes inactive ones, and makes sure that the data is equally spread out among all the nodes.
Node is the service that actually has the data and can query it. It receives commands/requests from the master server and responds to them appropriately.
JupiterSearch uses Badger as its underlying database.
When the master server receives a document to be stored, this is what happens in the backend:
- Master server: Looks at all the node(s) database sizes, picks one with the smallest database, and forwards the request to it.
- Node: Stores the full document in the database with a unique ID.
- Node: Converts the document to lowercase, tokenizes it (gets all words from it), and removes duplicates.
- Node: Loops through all the words/usernames/emails and stores them with the ID of the full document.