askorama / orama

🌌 Fast, dependency-free, full-text and vector search engine with typo tolerance, filters, facets, stemming, and more. Works with any JavaScript runtime, browser, server, service!

Home Page:https://docs.orama.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot create a string longer than 0x1fffffe8 characters when using data-persistence in server

imertz opened this issue · comments

Describe the bug

When trying to persist a large amount of data using the persistToFile function, Node.js throws an error: Cannot create a string longer than 0x1fffffe8 characters. This error is due to the V8 engine's limitation on string size.

To Reproduce

  1. Create a large dataset (larger than the V8 string size limit).
  2. Try to persist this data using the persistToFile function.

Expected behavior

The data should be successfully persisted to the file without any errors.

Environment Info

OS: MacOS
Node: 20.7.0
Orama: 2.0.0 beta7

Affected areas

Data Insertion

Additional context

Possible Solution:

Consider implementing a streaming approach to write the data to the file, which would avoid having to convert the entire Buffer to a string at once.

Thanks for opening this. @allevo we should rework the persistence plugin if we can reproduce this

Hi @imertz ! Have you tried with a different format? For instance https://docs.oramasearch.com/open-source/plugins/plugin-data-persistence#persisting-the-database-to-disk-server-usage . Will that fit your case?

I'll try it out and come back to you.

IIRC dpack worked for persiting the file to disk but if the file is larger than 512mb the restoreFromFile won't work. This mainly because of all implementations for restore rely on toString() method at some point. Which means it tries to create string over 512mb. So while writing to/restoring from file, it should be read with fs.createReadStream and written with fs.createWriteStream.

Here's a naive implementation with streaming support for Node.js with @msgpack/msgpack (basically the current binary format solution with streaming support):

import type { AnyOrama, RawData } from '@orama/orama';
import { create, load, save } from '@orama/orama';
import fs from 'fs';
import { decode, encode } from '@msgpack/msgpack';

export const persistToFile = async (
  db: AnyOrama,
  outputFile: string,
) => {

  const dbExport = await save(db);
  const msgpack = encode(dbExport);
  const serialized = Buffer.from(
    msgpack.buffer,
    msgpack.byteOffset,
    msgpack.byteLength,
  );

  const writeStream = fs.createWriteStream(outputFile);
  const chunkSize = 1024;
  for (let i = 0; i < serialized.length; i += chunkSize) {
    const end = Math.min(i + chunkSize, serialized.length);
    const chunk = serialized.slice(i, end);
    const hexChunk = chunk.toString('hex');
    writeStream.write(hexChunk);
  }
  writeStream.end();

  writeStream.on('finish', () => {
    console.log('File has been written as', outputFile);
  });
};

const deserialize = async (inputFile: string) => {
  return new Promise<RawData>((resolve, reject) => {
    const readStream = fs.createReadStream(inputFile, {
      encoding: 'utf8',
      // highWaterMark: 1024,
    });
    const chunks: Buffer[] = [];
    readStream.on('data', (chunk: string) => {
      chunks.push(Buffer.from(chunk, 'hex'));
    });

    readStream.on('end', () => {
      const combinedBuffer = Buffer.concat(chunks);
      const decodedData = decode(Buffer.from(combinedBuffer));
      resolve(decodedData as RawData);
    });
    readStream.on('error', (err) => {
      reject(err);
    });
  });
};

export const restoreFromFile = async (inputFile: string) => {
  const deserialized = await deserialize(inputFile);
  const db = await create({
    schema: {
      __placeholder: 'string',
    },
  });
  await load(db, deserialized);
  return db;
};

Disclaimer: I extracted these functions from from larger codebase so I haven't actually ran this exact piece of code but hopefully this helps. Also not sure if the chunking part on persistToFile function is the way to got but it worked for me.

We also noticed that you can write the msgpack encoded binary directly to file instead of turning it to hex before writing. This makes the msp file half the size.