Cannot create a string longer than 0x1fffffe8 characters when using data-persistence in server

Question

Cannot create a string longer than 0x1fffffe8 characters when using data-persistence in server

imertz opened this issue 9 months ago · comments

Yiannis Mertzanis commented 9 months ago

Describe the bug

When trying to persist a large amount of data using the persistToFile function, Node.js throws an error: Cannot create a string longer than 0x1fffffe8 characters. This error is due to the V8 engine's limitation on string size.

To Reproduce

Create a large dataset (larger than the V8 string size limit).
Try to persist this data using the persistToFile function.

Expected behavior

The data should be successfully persisted to the file without any errors.

Environment Info

OS: MacOS
Node: 20.7.0
Orama: 2.0.0 beta7

Affected areas

Data Insertion

Additional context

Possible Solution:

Consider implementing a streaming approach to write the data to the file, which would avoid having to convert the entire Buffer to a string at once.

Michele Riva · Answer 1 · Sat Dec 02 2023 01:28:52 GMT+0800 (China Standard Time)

Thanks for opening this. @allevo we should rework the persistence plugin if we can reproduce this

Tommaso Allevi · Answer 2 · Sat Dec 02 2023 01:55:51 GMT+0800 (China Standard Time)

Hi @imertz !
Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

Yiannis Mertzanis · Answer 3 · Wed Dec 06 2023 22:14:26 GMT+0800 (China Standard Time)

Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

I'll try it out and come back to you.

Valtteri Karesto · Answer 4 · Tue Feb 06 2024 17:34:36 GMT+0800 (China Standard Time)

IIRC dpack worked for persiting the file to disk but if the file is larger than 512mb the restoreFromFile won't work. This mainly because of all implementations for restore rely on toString() method at some point. Which means it tries to create string over 512mb. So while writing to/restoring from file, it should be read with fs.createReadStream and written with fs.createWriteStream.

Here's a naive implementation with streaming support for Node.js with @msgpack/msgpack (basically the current binary format solution with streaming support):

import type { AnyOrama, RawData } from '@orama/orama';
import { create, load, save } from '@orama/orama';
import fs from 'fs';
import { decode, encode } from '@msgpack/msgpack';

export const persistToFile = async (
  db: AnyOrama,
  outputFile: string,
) => {

  const dbExport = await save(db);
  const msgpack = encode(dbExport);
  const serialized = Buffer.from(
    msgpack.buffer,
    msgpack.byteOffset,
    msgpack.byteLength,
  );

  const writeStream = fs.createWriteStream(outputFile);
  const chunkSize = 1024;
  for (let i = 0; i < serialized.length; i += chunkSize) {
    const end = Math.min(i + chunkSize, serialized.length);
    const chunk = serialized.slice(i, end);
    const hexChunk = chunk.toString('hex');
    writeStream.write(hexChunk);
  }
  writeStream.end();

  writeStream.on('finish', () => {
    console.log('File has been written as', outputFile);
  });
};

const deserialize = async (inputFile: string) => {
  return new Promise<RawData>((resolve, reject) => {
    const readStream = fs.createReadStream(inputFile, {
      encoding: 'utf8',
      // highWaterMark: 1024,
    });
    const chunks: Buffer[] = [];
    readStream.on('data', (chunk: string) => {
      chunks.push(Buffer.from(chunk, 'hex'));
    });

    readStream.on('end', () => {
      const combinedBuffer = Buffer.concat(chunks);
      const decodedData = decode(Buffer.from(combinedBuffer));
      resolve(decodedData as RawData);
    });
    readStream.on('error', (err) => {
      reject(err);
    });
  });
};

export const restoreFromFile = async (inputFile: string) => {
  const deserialized = await deserialize(inputFile);
  const db = await create({
    schema: {
      __placeholder: 'string',
    },
  });
  await load(db, deserialized);
  return db;
};

Disclaimer: I extracted these functions from from larger codebase so I haven't actually ran this exact piece of code but hopefully this helps. Also not sure if the chunking part on persistToFile function is the way to got but it worked for me.

Valtteri Karesto · Answer 5 · Mon Feb 12 2024 17:16:03 GMT+0800 (China Standard Time)

We also noticed that you can write the msgpack encoded binary directly to file instead of turning it to hex before writing. This makes the msp file half the size.