Why not use libpmemobj ? [NEW]

Question

Why not use libpmemobj ? [NEW]

jinhao2 opened this issue 2 years ago · comments

Bug Report

Why not use libpmemobj ?
I do not think kvdk can support atomic operation only using APIs in libpmem.
batchWrite can write multiple KV pairs one by one, but could not roll back if failed.
DLinked_list could not update prev/next pointer in one transaction.
But it shows "Provide APIs to write multiple key-value pairs in an atomic batch. ",
Is there something wrong ?

peifeng si · Answer 1 · Tue Mar 15 2022 16:30:43 GMT+0800 (China Standard Time)

Thank you for your question.

Yes, you're right that using libpmemobj for atomic or transaction operations in KVDK is a possible and simple way. But for the performance consideration, we decided to implement it by KVDK self to reduce the overhead of redo and undo actions.

In current code, only support batch write for the string value type, and it can roll back if the server crash before all data in a batch is successfully written.

FYI. We are now working on batch write for all other data types, including DLinked_list, and will implement a pessimistic transaction based on the batch write. The estimated PR date is in May.

jin.hao1 · Answer 2 · Tue Mar 15 2022 19:44:00 GMT+0800 (China Standard Time)

Thank you for your answer.

Not only batch write, I could not understand how SSetImpl() can achieve atomicity .
SSetImpl() operation first allocates pm space in which already perssist a data entry, secondly persist into DLink in skiplist.
There are several persist operation in all. If the server crash between them, how can it roll back to the clean state?

peifeng si · Answer 3 · Tue Mar 15 2022 20:09:43 GMT+0800 (China Standard Time)

As I mentioned, in the current code we only support batch write for string type (you may check out the BatchWrite() API for details).

Regarding the SSetImpl(), it is used for inserting a Key-Value into a sorted collection (internally it's a skiplist), and we will add batch write for it soon but the development is still ongoing.

We plan to have full atomic and transaction support of all data types in May, and welcome to have a try by then and give us your suggestions :-)

jin.hao1 · Answer 4 · Tue Mar 15 2022 20:53:10 GMT+0800 (China Standard Time)

OK, Thanks.

jin.hao1 · Answer 5 · Wed Mar 16 2022 08:47:31 GMT+0800 (China Standard Time)

Hi, KVKIT.
I am interested by your description implement full atomic and transaction by KVDK without libpmemobj.
Only 8bytes can be written atomicly without transaction logs, how to implement the batch write?
Can you give me some clue in it ?

peifeng si · Answer 6 · Wed Mar 16 2022 09:24:28 GMT+0800 (China Standard Time)

Hi Jinhao,

In brief, for all KVs in a batch write, we write them in PMEM one by one, and in the meanwhile, we also record their addresses and status in a place in PMEM. During recovery, if the status shows that only part of the KVs in a batch was written, we will ignore them and reclaim their PMEM space. For more details, please refer to the code that uses struct PendingBatch.

// Used to record batch write stage and related records address, this should be
// persisted on PMem
//
// The stage of a processing batch write will be Processing, the stage of a
// initialized pending batch file or a finished batch write will be Finish
//
// Layout: batch write stage | num_kv in writing | timestamp of this batch write
// | record address
struct PendingBatch {
  enum class Stage {
    Finish = 0,
    Processing = 1,
  };

  PendingBatch(Stage s, uint32_t nkv, TimeStampType ts)
      : stage(s), num_kv(nkv), timestamp(ts) {}

  // Mark batch write as process and record writing offsets.
  // Make sure the struct is on PMem and there is enough space followed the
  // struct to store record
  void PersistProcessing(const std::vector<PMemOffsetType>& record,
                         TimeStampType ts);

  // Mark batch write as finished.
  void PersistFinish();

  bool Unfinished() { return stage == Stage::Processing; }

  Stage stage;
  uint32_t num_kv;
  TimeStampType timestamp;
  PMemOffsetType record_offsets[0];
};