openacid / slim

Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Home Page:https://openacid.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Organising the Key Offsets

vasundhara785 opened this issue · comments

Can the Trie Lookup happen without organizing the key offsets?

I want to have similar implementation of redis cache .Load all the keys into trie and then lookup should happen from any other application as well.

The exisiting implementation is a sort of lookup of key needs to load the key offsets.

Example:
App1 needs to load the keys and store in trie object.
App2 should only lookup for the key and should not again load all the key offsets.

I do not quite understand the term key offset.

You can just store an array of keys with slim without values. Just pass a nil to the value argument:

func NewSlimTrie(e encode.Encoder, keys []string, values interface{}, opts ...Opt) (*SlimTrie, error) {

Feature i need using the Trie is

Application 1 should load all the keys with their offsets along with data.
st, err := index.NewSlimIndex(keyOffsets, data)

Application 2 should just look up with the key without actually knowing all keys and values but initialising the trie object is requiring all keys and values. st, err := index.NewSlimIndex(keyOffsets, data)

Is it possible to initialise the trie object without actually loading all the keys and data?

As we are going to have 500Million key value pairs to store. So every application loading all the keys and values to intialise the Trie object is an overhead for the application.Creating the whole trie every time for every version keeps doubling up memory and affects the Space Complexity very badly

Thanks for the explanation!

Is it possible to initialise the trie object without actually loading all the keys and data?

If what you want is to build a sparse index, e.g. to create an index for every 5 items, slimtrie provides a Range mode:

  • Choose 1 key/value for every 5 key/values in your dataset. Build a slice.
  • Build slimtrie from this slice with option: Complete: Bool(true). This eliminates the false-positive for range-get query, e.g. searching for an item of a key in range [1000, 1200).

st, err := NewSlimTrie(encode.I32{}, keys, values, Opt{Complete: Bool(true)})
ta.NoError(err)

Reference:

// RangeGet look for a range that contains a key in SlimTrie.
//
// A range that contains a key means range-start <= key <= range-end.
//
// It returns the value the range maps to, and a bool indicate if a range is
// found.
//
// A positive return value does not mean the range absolutely exists, which in
// this case, is a "false positive".
//
// Since 0.4.3
func (st *SlimTrie) RangeGet(key string) (interface{}, bool) {

E.g., with a slimtrie built from the following keys:

{
"aa": 1,
"az": 2,
"cd": 3,
}
  • RangeGet("a") returns nil, false
  • RangeGet("ab") returns 1, true
  • RangeGet("az") returns 2, true
  • RangeGet("azz") returns 2, true
  • RangeGet("b") returns 2, true
  • RangeGet("d") returns 3, true

@drmingdrmer What would be the CPU , Memory and time taken for loading and intialising the Trie object with 1Lakh unique key value pairs till this st, err := index.NewSlimIndex(keyOffsets, data)

I'm not very sure about the memory cost.

For cpu cost, BenchmarkNewSlimTrie should tell you.

There is a benchmark with a similar setup to your case:

BenchmarkNewSlimTrie/200kweb2-4          1000000              1002 ns/op

Building a slimtrie from 2 Lakh of words collected from web takes 1 us/key, i.e., 100 milliseconds for 1 Lakh words.

But the performance varies with different key sets. You may like to benchmark it yourself:DDD

But this wasn't the case for me. It was taking around 30seconds for 1Lakh words. Please suggest .

Attaching the sample code and key value.
ild.go.txt
kv.csv.txt

May I have your complete csv file for a test?

I mean, a test with at least 1 Lakh of lines.

kv.csv.txt

Here is the test csv file

This is quite small. Did you mean that creating slimtrie from this 142 bytes file takes 30 seconds??

I may need your 1 Lakh words file to see what takes so much time.

Yes, To create a slim trie object from the csv file it was taking around 30 seconds . The information in the file is only the key(12 digit number) and value( 2 to 3 characters) used to create slim trie object.You can refer to the sample program attached as well.

With the file you provided it takes only 0.6 seconds.
The file kv.csv.txt is quite small, I do not know why you said it takes 30 seconds 🤔

time go run ild.go
Status  false 88.153µs

real    0m0.667s
user    0m0.524s
sys     0m0.258s

My system has 2 cpu core processor and 1GB RAM. Testing on Low System Limitations. How about your processors?

Is the Trie Implemetation is persistent storage ?

I mainly test slimtrie on my iMac, 3.8Ghz core i5 4 cores.

No. when creating, it is purely in-memory operation.

Maybe you could have a profile on your machine to see what costs most of the time with a benchmark: go test ./... -cpuprofile prof.cpu -memprofile prof.mem -bench=. -run=none

And I still can not believe that a 7 lines input file takes that much time.

wc kv.csv.txt
       7       8     142 kv.csv.txt