Resin is a remote
- vector database where the key is a 64-bit vector that may or may not translate into a string (it's up to you) and the payload is a list of Int64's. What the payload translates into is also your choice. At querying time the key is either a fixed length Int64 or a variable length query expression. Each node in such an expression tree carries a fixed length key and also define either an AND, OR or NOT set operation.
- VectorNode. With it you can define and then traverse a 64-bit wide vector space containing anything that is willing to be constrained by it. The payload of each node is a list of Int64's.
One application of such an architecture is a language model framework. Another is a string database. A third is a kind of search engine that lets you talk to your data using natural language or structured queries. Resin is at least those things.
You may install Resin in the cloud distributed onto many machines, each one carrying collections of collections and indices for each (analyzed) key in each collection, while running one central postings server. Or you can run it on your laptop.
Here is a non-exhaustive list of features.
- Create, append to and query document collections of any format (JSON format included out-of-the-box)
- Query in natural language or structured
- Create 1-n relationships, e.g. one utterance to many documents
- Create n-1 relationships, e.g. many utterances to one intent
- Create intent-based applications
- Create embeddings/language models from collections
- Build custom models in new vector spaces, based on previous models
- Plug in your own reader/writer filters
- Build digital conversationalists (e.g. chat bots, search engines, digital assistants)
All features are embeddable (by using Resin as a library) but also distributable (by talking to Resin over HTTP).
What you can do locally you can usually also also do remotely.
Resin includes
- query collections of documents naturally or structured
- create new collections from slices of existing collections, slices that are defined by queries
- create new document collections
- query naturally/structured over HTTP with content type negotiation
Implement IReader or IWriter to run your own logic before/after a read/write.
To find documents where title is
Rambo or First Blood but only if the genre isn't books
you may use natural language or structured:
+(title:rambo title:first blood) -(genre:books)
Resin creates a vector space of words embedded as bags-of-characters. Variable length strings are encoded into a fixed-length Int64 vector space. Even though such a space can be computationally heavy, this type of embedding was chosen for its encoding speed and low CPU pressure at querying time. Read on to learn why it's fast.
Fast to encode, fast to query.
Supports fuzzy queries since it considers the
to be the same word as hte
.
Supports wildcard queries since it considers te
to be similar to the word the
.
It considers the
to be the same word as hte
, i.e. does not encode order of characters, only their frequency.
Operations such as dot product and cosine similarity on vectors in this model is O(n) where n is the number of significant component pairs.
The word pineapple
is represented as a sparse array that can carry a maximum of Int64 number of components each the size of a Int32:
SortedList<long, int>{
{(long)'p', 3},
{(long)'i', 1},
{(long)'n', 1},
{(long)'e', 2}
{(long)'a', 1},
{(long)'l', 1},
};
pen
is represented as:
SortedList<long, int>{
{(long)'p', 1},
{(long)'e', 1},
{(long)'n', 1}
};
To compare them using linear algebra we need both vectors to be of the same width and they are. They are both Int64 bits wide. Thus we can compute on two vectors by pairing components by key.
Consider a six-dimensional vector space.
pineapple
has six significant components: [3][1][1][2][1][1]
pen
has three significant components: [1][0][1][1][0][0]
pineapple
- pen
= iapple
because [3][1][1][2][1][1]
- [1][0][1][1][0][0]
= [2][1][0][1][1][1]
pineapple
+ pen
= pineapplepen
because [3][1][1][2][1][1]
+ [1][0][1][1][0][0]
= [4][1][2][3][1][1]
With all embeddings aggregated as a VectorNode graph you have a model that form clusters of similar words, each cluster carrying a payload that is a list of document IDs.
Natural language queries are parsed into expression trees with nodes carrying words and AND, OR or NOT set operations. The expression is serialized and executed (reduced) on a remote server, producing a set of IDs of documents that came from as many clusters as there are (distinct) additative terms in the query.
That set is sorted by score and a window defined by skip and take parameters is returned to the orchestrating server, who materializes the list of document IDs, i.e. reads and returns to the client a document stream formatted according to the HTTP client's "Accept" header.
The model is a graph of documents embedded as bags-of-words. Documents cluster around topics.
Natural language queries are parsed into a tree of document sized vectors. A cluster of documents is located by reducing the clause vectors to a single document by using vector addition/subtraction and by navigating the index graph by evaluating the cos angle between the query and the clusters. The end-result of the scan is a cluster ID that also corresponds to a postings list ID. If the topic is a big one, the result set will be large. If you've managed to pinpoint a shallow cluster your result set will be smaller.
Consider a five-dimensional vector space.
I have a pineapple
has four significant components: [1][1][1][1][0]
pineapple pen
has two significant components: [0][0][0][1][1]
I have a pineapple
- pineapple pen
= I have a
or [1][1][1][0][0]
I have a pineapple
+ pineapple pen
= I have a pineapple pineapple pen
or [1][1][1][2][1]
We want a document vector space because we want to represent each document once and only once, so that with a single scan of the index we can find a list of document IDs connected to a topic.
Download a clone of this repository, launch the solution in Visual Studio to build and publish it. Then create a IIS site that points to [path_of_repository]/src/publish. Make sure the app pool type is "unmanaged".
Read below how to create document collections. Use your favorite HTTP client to create a collection from an array of JSON documents. Read on to learn about querying your data, how to slice and then re-model it.
Come back to this page and register any bugs or issues you might find.
To create collections from your favorite data you may host one of these servers yourself, privately or publicly, or you can use a free search cloud.
HTTPS POST didyougogo.com/io/[collection_name]
Content-Type:application/json
[
{
"field1":"value1"
},
{
"field1":"value2"
}
]
[
1,
2
]
HTTPS GET didyougogo.com/io/[collection_name]?id=[document_id]
Accept:application/json
HTTPS GET didyougogo.com/io/[collection_name]?q=[phrase-or-term]&fields=title&skip=0&take=10
Accept:application/json
HTTPS GET didyougogo.com/io/[collection_name]?&qf=[structured_query]&skip=0&take=10
Accept:application/json
HTTPS GET didyougogo.com/?q=[phrase-or-term-query]&fields=title&skip=0&take=10&collection=[collection_name]
HTTPS GET didyougogo.com/queryparser/?q=[phrase-or-term-query]&qf=[structured_query]&fields=title&skip=0&take=10&collection=[collection_name]
HTTPS POST [hostname]/io/[collection_name]
Content-Type:[IWriter.ContentType]
Custom data payload
HTTPS GET [hostname]/io/[collection_name?[Custom query payload]]
Content-Type:[IReader.ContentType]]
or when you have a larger query payload
HTTPS PUT [hostname]/io/[collection_name]
Content-Type:[IReader.ContentType]]
Custom query payload
Plug in your custom read and write filters here.
https://github.com/kreeben/resin/tree/master/src/Sir.HttpServer
Execute AND, OR and NOT set operations over local lists of Int64's.
https://github.com/kreeben/resin/tree/master/src/Sir.Postings
Database and search index. Orchestrates remote set operations.
https://github.com/kreeben/resin/tree/master/src/Sir.Store
- v0.1a - bag-of-characters term vector space language model
- v0.2a - HTTP API comprised of distributable search microservices
- v0.3a - boolean query language with support for AND ('+'), OR (' '), NOT ('-') and scope ('(', ')').
- v0.4b - bag-of-words document vector space language model
- v0.5 - semantic language model
- v0.6 - local join between collections
- v0.7 - private online collections
- v0.8 - join (orchestrate) over private/public collections
- v0.9 - add support for voice models
- v1.0 - add support for image models
- v2.0 - implement text/image-model-to-voice
- v2.1 - implement text/voice-model-to-image
- v2.2 - implement image/voice-model-to-text
- v3.0 - AI