fergiemcdowall / search-index

A persistent, network resilient, full text search library for the browser and Node.js

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Building index by calling PUT once for each document does not return all matches in query afterwards

elega9t opened this issue · comments

Hi,

I am building my index by async loading files and calling a PUT for each file once its loaded.

The QUERY on the index behaves inconsistently. Eg, lets say the number of matching documents for a query is 3; sometimes the query returns 1 match, sometimes 2; never 3.

I have tried changing the code to collect all documents in an array and doing just 1 PUT. In such a case, the query returns all 3 documents.

I do not want to load all documents in memory at once, thus would want to revert back to calling PUT for each documents when its loaded.

Any insights on this issue would be very much appreciated.

Thanks.

It sounds like you may have some basic issues with the way you have set search-index up.

If you can post the code here, then we can try to help you.

Hi, here are the relevant extracts of code;

`
const { PUT, QUERY } = await si({
db: memdown('journals')
});

const files: Array = [...]; // List of files

const indexingPromises = [];
_.forEach(files, (file: string) => {
  indexingPromises.push(this.indexFile(file));
});
Promise.all(indexingPromises)
  .then(() => {
    console.log(moment.utc().format() + ': Finished building search index');
  })
  .catch((err) => {
    console.log(moment.utc().format() + ': Failed building search index', err);
  });

`
and

`
indexFile(file: string) {
return new Promise((resolve, reject) => {
const promises = [];

   const readStream = fs.createReadStream(file).pipe(split(JSON.parse));

   readStream.on('data', (entry) => {
       promises.push(PUT(entry));
  });

   readStream.on('end', () => {
     Promise.all(promises)
       .then(() => {
         resolve();
       });
   });

   readStream.on('error', (err) => {
     reject(err)
   });
 });

}
`
Hope this explains the setup I am using. Thanks

How about arranging your code like this:

(async function() {
  const fs = require('fs')
  const si = require('search-index')
  const memdown = require('memdown')

  const { PUT, QUERY } = await si({
    db: memdown('journals')
  })

  const batch = fs.readdirSync('journals').map(
    j => JSON.parse(fs.readFileSync('journals/' + j))
  )
  
  await PUT(batch)

  console.log(JSON.stringify(await QUERY({
    OR: ['dag', 'emmey', 'dorie', 'cordell', 'gregoire']
  }), null, 2))
  
}())

I put together an example repo. As you can see the query always returns a consistent result.

Hi thanks for taking a look. I wanted to use streams and promise because each file can have multiple json entries and the files themselves could be huge (gigs). So, reading the entire content in memory at once is not something feasible.

I'll create a branch on the example repo and update the code and sample data to resemble what I am working with.

Hi, I have updated the code in fork https://github.com/elega9t/search-index-issue-537

If you run the code multiple times, you would notice the inconsistent results being returned.

Based on the query, it should return 3 results, however you would notice that sometimes it returns 1 and sometimes 2.

  const { PUT, QUERY } = await si({
    db: memdown('journals')
  });

  function readFile(file) {
    return new Promise((resolve, reject) => {
      let readStream = fs.createReadStream('journals/' + file).pipe(split(JSON.parse, null, { trailing: false }));

      const promises = [];
      readStream.on('data', (entry) => {
        promises.push(PUT([ entry ]));
      });

      readStream.on('end', () => {
        Promise.all(promises)
          .then(() => resolve());
      });

      readStream.on('error', (err) => {
        reject(err)
      });
    });
  }

  const promises = fs.readdirSync('journals').map(file => readFile(file));

  await Promise.all(promises);

  const results = await QUERY(
    { SEARCH: [ 'Chaudron' ] },
    { DOCUMENTS: true, PAGE: { NUMBER: 0, SIZE: 20 } }
  );

  console.log(JSON.stringify(results));

}())

Thanks

The problem here is that you can't PUT documents concurrently so this line is creating the problem. You should wait for the previous PUT to finish before starting a new PUT. This should probably be better explained in the documentation.

Generally its best to PUT batches of documents rather rather than individual documents, so the best approach here would probably still be to PUT all the objects in the file at once, rather than PUTing them individually.

Iterating over collections of Promises withing the context of streams remains a tricky thing to do, so personally I avoid streams for document processing even though they seem like a reasonable approach at first glance.

Thanks a lot for looking into this issue. I will write a flatMap utility to queue PUTs for now.

Hi, I have chained Promises so that they run sequentially rather than concurrently. However, now I get 2 results consistently, rather than expected 3.

function readFile(file) {
  return new Promise((resolve, reject) => {
    let readStream = fs.createReadStream('journals/' + file).pipe(split(JSON.parse, null, { trailing: false }));

    const promise = Promise.resolve();
    readStream.on('data', (entry) => {
      promise.then(() => {
        console.log('Indexing: ' + JSON.stringify(entry));
        return PUT([ entry ]);
      });
    });

    readStream.on('end', () => {
      promise.then(() => resolve());
    });

    readStream.on('error', (err) => {
      reject(err)
    });
  });
}

let promise = Promise.resolve();

fs.readdirSync('journals').forEach(file => {
  promise = promise.then(() => { return readFile(file); });
});

await promise;

const results = await QUERY(
  { SEARCH: [ 'Chaudron' ] },
  { DOCUMENTS: true, PAGE: { NUMBER: 0, SIZE: 20 } }
);

console.log(JSON.stringify(results));

Please, can you take a look at this and suggest a workaround for now.

A sensible approach would be to do something like this:

(async function() {
  const fs = require('fs')
  const split = require('split')
  const si = require('search-index')
  const memdown = require('memdown')

  const { PUT, QUERY } = await si({
    db: memdown('journals')
  });

  const readFile = file => new Promise(resolve => {
    let batch = []
    let readStream = fs.createReadStream('journals/' + file).pipe(
      split(JSON.parse, null, { trailing: false })
    ).on(
      'data', d => batch.push(d)
    ).on(
      'end', () => resolve(batch)
    )
  })

  for (f of fs.readdirSync('journals')) {
    await PUT((await readFile(f)))
  }
  
  const results = await QUERY(
    { SEARCH: [ 'Chaudron' ] },
    { DOCUMENTS: true, PAGE: { NUMBER: 0, SIZE: 20 } }
  );

  console.log(JSON.stringify(results))
  
}())

Hi, the bug in my sequenced code was to update the promise once document has been PUT. I really dont want to accumulate all documents in memory, so returning array and then doing a PUT is not suitable.

  let promise = Promise.resolve();
  readStream.on('data', (entry) => {
    promise = promise.then(() => {
      console.log('Indexing: ' + JSON.stringify(entry));
      return PUT([ entry ]);
    });
  });

Now I am getting all 3 documents as expected. Thanks for your help. cheers!

Just to reiterate- PUTing batches is computationally less expensive than PUTing those documents individually- so it will always be quicker to use batches. Start off with the batch approach, and only switch to individual PUTs if/when you run out of memory.

Also, if memory use is a pain point, then dont use memdown- just stick with the default leveldown/leveljs.

In the forthcoming search-index@3 it is possible to PUT in parallel.

Anyway- glad to hear that you got it working! 👍🙂