beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Home Page:http://beir.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AssertionError: Elastic-Search Window too large, Max-Size = 10000

zuliani99 opened this issue · comments

Using BM25, for sparse embedding in a pretty big datasets (eg. FiQA), I get the following assertion error:
AssertionError: Elastic-Search Window too large, Max-Size = 10000

The function that call BM25 is the next one:

def sparse_embeddings_bm25(dataset_name, corpus, queries, qrels, k_primes):
  '''
  PURPOSE: compute the sparse embedding using the BM25 implementation from beir and elastichsearch
  ARGUMENTS:
    - dataset_name: string describing the dataset name
    - corpus: sequence of documents 
    - queries: sequence of queries
    - qrels: ground truth of query document relevance
    - k_primes: list of number of top k prime documents to return
  RETURN: see embeddings return values
  '''
  hostname = 'localhost' 
  index_name = dataset_name
  initialize = True # Delete existing index with same name and reindex all documents

  print(f'{dataset_name} - BM25')
  model = BM25(index_name=index_name, hostname=hostname, initialize=initialize) # Defining the BM25
  return embeddings('Sparse', model, corpus, queries, qrels, k_primes)

I've already tryed to create the index before running BM25 and set initialize = False, but doing so I need somewhat to pass to the index the corpus and the queries.

Note that I'm running all the application in Google Colab Pro, I don't know if this is important or not.