Iterating over all KVs by their insertion order

Question

Iterating over all KVs by their insertion order

amityahav opened this issue a year ago · comments

following this where i explained that there may be alot of random io. i suggest to add another functionality which will iterate over all the kvs in the DB more efficiently.
my proposal is modifying the BTREE implementation of the in-memory index in such a way that all KV pairs will be chained in a doubly linked list. this list will be constructed during insertion time where for a new KV inserted we will link it to the previous insert KV pair. (also handling the cases of overwrites). we will also keep 2 pointers for the head and the tail of the list and by that allow the user to iterate in ascending/descending of insertion order. by doing all of that when iterating the reading from WAL files will be in a more sequential manner which will utilize the kernel's page cache better and also the WAL applicative blockCache.

roseduan · Answer 1 · Tue Aug 22 2023 19:06:37 GMT+0800 (China Standard Time)

We are using the google btree repo, I think it is a little difficult to change the btree code.

The #265 only acquires for iterating all keys, so we just call the Ascend to get all keys directly, it will be efficient since they are all in memory.

Amit Yahav · Answer 2 · Tue Aug 22 2023 19:56:54 GMT+0800 (China Standard Time)

Currently the Ascend function also invoke reading from WAL from values but if you omit that its indeed fast.
but as for the usecase where u do actually want to iterate over all KV pairs maybe its worth the effort to optimize its performance. it will probably will require an in-house implementation of the BTREE or at least a fork of Google's. Lemme know what u think

roseduan · Answer 3 · Tue Aug 22 2023 20:49:58 GMT+0800 (China Standard Time)

Yes if we want to iterate all k/v pairs in db, it will read WAL to get data.

Umm. But I am not sure whether someone needs it, maybe we can do it according to the actual situation.

Amit Yahav · Answer 4 · Tue Aug 22 2023 21:11:11 GMT+0800 (China Standard Time)

ill think ill come up with a POC for that anyways just to prove myself that its indeed faster

Amit Yahav · Answer 5 · Tue Aug 22 2023 21:46:52 GMT+0800 (China Standard Time)

@roseduan a usecase that i find this feature useful is the merging operation where instead of iterating over the all of the WAL files you can instead iterate only over the valid values efficiently which will reduce the merging time when the WAL files are large

roseduan · Answer 6 · Wed Aug 23 2023 21:08:07 GMT+0800 (China Standard Time)

@roseduan a usecase that i find this feature useful is the merging operation where instead of iterating over the all of the WAL files you can instead iterate only over the valid values efficiently which will reduce the merging time when the WAL files are large

Oh why? If we iterate the btree to get all values, it will block the read and write.

Amit Yahav · Answer 7 · Wed Aug 23 2023 22:26:33 GMT+0800 (China Standard Time)

Yes you are right totally missed that

Amit Yahav · Answer 8 · Wed Aug 23 2023 23:06:46 GMT+0800 (China Standard Time)

But also if you merge and writes are happening concurrently you only merge a snapshot in time and there's new data that is not part of the merge process and needs to be appended to the new data files. How do u handle that?

roseduan · Answer 9 · Wed Aug 23 2023 23:26:40 GMT+0800 (China Standard Time)

But also if you merge and writes are happening concurrently you only merge a snapshot in time and there's new data that is not part of the merge process and needs to be appended to the new data files. How do u handle that?

The newly added data while merging will be treated as the normal WAL records, and it will be loaded from wal to rebuild the index.

eg.
I merge seg 1 2 3 4
and generate a hint file 1 2
and the new seg files while merging is 5 6

so when restarting, build the index from hint file 1 2 first.
then load all data from 5 6, build index as normal.

Amit Yahav · Answer 10 · Thu Aug 24 2023 16:24:44 GMT+0800 (China Standard Time)

So i've implemented a working BTREE that respects the insertion order for arbitrary keys and it was done by adding small amount of code to the wrapper class and not the google's implementation. then i've created a simple benchmark to compare among the existing Ascend performance and the new DescendByInsertion with 600k KV pairs in the DB. in most of my tests DescendByInsertion would gain at least x5 faster performance when enabling the blockCache. when disabling the performance is quite the same.
so lemme know if you would be intersted in such featrue. @roseduan

func Benchmark_Descend(b *testing.B) {
	options := rosedb.DefaultOptions
	options.BlockCache = 32 * 1024 * 10
	options.DirPath = "./tmp/rosedb"

	db, err := rosedb.Open(options)
	if err != nil {
		panic(err)
	}

	//for i := 0; i < 600000; i++ {
	//	key := utils.RandomKey(10)
	//	value := utils.RandomValue(1024)
	//	_ = db.Put(key, value)
	//}

	utils.CacheHits = 0
	utils.CacheMiss = 0
	b.Run("descend", func(b *testing.B) {
		b.ResetTimer()
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			utils.CacheHits = 0
			utils.CacheMiss = 0
			t := time.Now()
			_ = db.DescendByInsertion(func(k []byte, v []byte) error {
				return nil
			})
			fmt.Printf("completed in %d ms\n",
				time.Now().Sub(t).Milliseconds(),
			)
			fmt.Printf("cache hits: %d\n cache misses: %d\n", utils.CacheHits, utils.CacheMiss)
		}
	})
	db.Close()

//Benchmark_Descend/descend
//completed in 1095 ms
//cache hits: 598998
//cache misses: 20211
//Benchmark_Descend/descend-10    1	1095058791 ns/op	2131705808 B/op	 3059916 allocs/op

	db, err = rosedb.Open(options)
	if err != nil {
		panic(err)
	}
	defer db.Close()
	utils.CacheHits = 0
	utils.CacheMiss = 0
	b.Run("ascend", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			t := time.Now()
			db.Ascend(func(k []byte, v []byte) (bool, error) {
				return true, nil
			})
			fmt.Printf("completed in %d ms\n",
				time.Now().Sub(t).Milliseconds(),
			)
			fmt.Printf("cache hits: %d\n cache misses: %d\n\n", utils.CacheHits, utils.CacheMiss)
			utils.CacheHits = 0
			utils.CacheMiss = 0
		}
	})


//Benchmark_Descend/ascend
//completed in 5139 ms
//cache hits: 295
//cache misses: 618914

//Benchmark_Descend/ascend-10      1	5139481083 ns/op	21792695688 B/op	 4271704 allocs/op
}

roseduan · Answer 11 · Thu Aug 24 2023 22:20:54 GMT+0800 (China Standard Time)

DescendByInsertion means the key is sorted in insertion order not lexicographically?
If so, I am not sure whether someone needs the feature.

Amit Yahav · Answer 12 · Fri Aug 25 2023 01:41:48 GMT+0800 (China Standard Time)

It's by the latest insertion but it's basically iterating all keys and values in general in faster way

roseduan · Answer 13 · Fri Aug 25 2023 09:50:02 GMT+0800 (China Standard Time)

Thanks. In my experience, maybe most users want to iterate the keys in lexicographical order, we can consider it when someone needs it.

Thanks anyway.