Reduce memory usage or build index size?

Question

Reduce memory usage or build index size?

H4ad opened this issue 9 months ago · comments

Vinicius Lourenço commented 9 months ago

Current Behavior

After #441 be merged, we had ~34% reduce of index size when saving to .json.

But the memory usage increased, so I rewrote the test to generate truly random data (to not take advantage of v8 string cache):

import { faker } from "@faker-js/faker";
import { writeFileSync } from "fs";

// create fake data
const data = Array.from({length: 100000}, () => ({
  id: faker.string.uuid(),
  name: faker.person.firstName() + '_' + Math.random().toString(16).slice(2),
  surname: faker.person.lastName() + '_' + Math.random().toString(16).slice(2),
  fiscalCode: faker.string.alphanumeric({length: 16, casing: "uppercase"}),
  season: faker.number.int({min: 2010, max: 2020}),
}));

writeFileSync('./large-object.json', JSON.stringify(data, null, 2));

And then I see the memory usage:

import { readFileSync } from "fs";
import { resolve } from "path";
import { fileURLToPath } from "url";
import { create, insertMultiple, search } from "./dist/index.js";

// function to print the used memory
function printUsedMemory() {
  const used = process.memoryUsage().heapUsed / 1024 / 1024;
  console.log(`The script uses approximately ${Math.round(used * 100) / 100} MB`);
}

const __dirname = resolve(fileURLToPath(import.meta.url), '..');

const data = JSON.parse(
  readFileSync(__dirname + '/large-object.json', 'utf8'),
);

printUsedMemory();

// create index and add data
const db = await create({
  schema: {
    id: "string",
    name: "string",
    surname: "string",
    fiscalCode: "string",
    season: "number",
  },
});

await insertMultiple(db, data);
console.log("Index created");

printUsedMemory();

// search the index
const results = await search(db, {
  term: "john",
  properties: "*",
});

printUsedMemory();

The results with internalId are:

The script uses approximately 44.61 MB
Index created
The script uses approximately 546.78 MB
The script uses approximately 549.3 MB

But, if we disable/remove that feature:

The script uses approximately 44.65 MB
Index created
The script uses approximately 502.32 MB
The script uses approximately 504.65 MB

We can reduce the memory usage by ~8%.

New Behavior

Can we have both scenarios where we store the IDs as internal and then remap them? Or is it too much work?
Can we give some flag to the user to choose between lower index size or reduced memory usage?