Organising the Key Offsets
vasundhara785 opened this issue · comments
Can the Trie Lookup happen without organizing the key offsets?
I want to have similar implementation of redis cache .Load all the keys into trie and then lookup should happen from any other application as well.
The exisiting implementation is a sort of lookup of key needs to load the key offsets.
Example:
App1 needs to load the keys and store in trie object.
App2 should only lookup for the key and should not again load all the key offsets.
I do not quite understand the term key offset
.
You can just store an array of keys with slim
without values. Just pass a nil
to the value
argument:
Line 192 in d27f7e9
Feature i need using the Trie is
Application 1 should load all the keys with their offsets along with data.
st, err := index.NewSlimIndex(keyOffsets, data)
Application 2 should just look up with the key without actually knowing all keys and values but initialising the trie object is requiring all keys and values. st, err := index.NewSlimIndex(keyOffsets, data)
Is it possible to initialise the trie object without actually loading all the keys and data?
As we are going to have 500Million key value pairs to store. So every application loading all the keys and values to intialise the Trie object is an overhead for the application.Creating the whole trie every time for every version keeps doubling up memory and affects the Space Complexity very badly
Thanks for the explanation!
Is it possible to initialise the trie object without actually loading all the keys and data?
If what you want is to build a sparse index, e.g. to create an index for every 5 items, slimtrie provides a Range
mode:
- Choose 1 key/value for every 5 key/values in your dataset. Build a slice.
- Build slimtrie from this slice with option:
Complete: Bool(true)
. This eliminates the false-positive for range-get query, e.g. searching for an item of a key in range[1000, 1200)
.
slim/trie/slimtrie_complete_test.go
Lines 28 to 29 in d27f7e9
- Query slimtrie with
RangeGet
:Line 985 in d27f7e9
Reference:
Lines 64 to 75 in d27f7e9
E.g., with a slimtrie built from the following keys:
{
"aa": 1,
"az": 2,
"cd": 3,
}
RangeGet("a")
returns nil, falseRangeGet("ab")
returns 1, trueRangeGet("az")
returns 2, trueRangeGet("azz")
returns 2, trueRangeGet("b")
returns 2, trueRangeGet("d")
returns 3, true
@drmingdrmer What would be the CPU , Memory and time taken for loading and intialising the Trie object with 1Lakh unique key value pairs till this st, err := index.NewSlimIndex(keyOffsets, data)
I'm not very sure about the memory cost.
For cpu cost, BenchmarkNewSlimTrie
should tell you.
There is a benchmark with a similar setup to your case:
BenchmarkNewSlimTrie/200kweb2-4 1000000 1002 ns/op
Building a slimtrie from 2 Lakh of words collected from web takes 1 us/key, i.e., 100 milliseconds for 1 Lakh words.
But the performance varies with different key sets. You may like to benchmark it yourself:DDD
But this wasn't the case for me. It was taking around 30seconds for 1Lakh words. Please suggest .
Attaching the sample code and key value.
ild.go.txt
kv.csv.txt
May I have your complete csv file for a test?
I mean, a test with at least 1 Lakh of lines.
Here is the test csv file
This is quite small. Did you mean that creating slimtrie from this 142 bytes file takes 30 seconds??
I may need your 1 Lakh words file to see what takes so much time.
Yes, To create a slim trie object from the csv file it was taking around 30 seconds . The information in the file is only the key(12 digit number) and value( 2 to 3 characters) used to create slim trie object.You can refer to the sample program attached as well.
With the file you provided it takes only 0.6 seconds.
The file kv.csv.txt
is quite small, I do not know why you said it takes 30 seconds 🤔
time go run ild.go
Status false 88.153µs
real 0m0.667s
user 0m0.524s
sys 0m0.258s
My system has 2 cpu core processor and 1GB RAM. Testing on Low System Limitations. How about your processors?
Is the Trie Implemetation is persistent storage ?
I mainly test slimtrie on my iMac, 3.8Ghz core i5 4 cores.
No. when creating, it is purely in-memory operation.
Maybe you could have a profile on your machine to see what costs most of the time with a benchmark: go test ./... -cpuprofile prof.cpu -memprofile prof.mem -bench=. -run=none
And I still can not believe that a 7 lines input file takes that much time.
wc kv.csv.txt
7 8 142 kv.csv.txt