castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page:http://pyserini.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pyserini.encode : 'texts': batch_info['text'], KeyError: 'text'

ashishakkumar opened this issue · comments

I have a jsonl file containing dictionaries with entries like this :

{'id': 'NCT01740609',
'contents': "A Study To Assess The Safety Of PF-06342674 In Healthy Volunteers@&The purpose of this study is to evaluate the safety, tolerability, pharmacokinetics and immunogenicity of single escalating doses PF-06342674.@&None@&COMPLETED@&['Healthy']@&ALL@&False@&18 Years@&None@&['Phase 1', 'RN168', 'Healthy Volunteers']@&None@&None@&None@&Inclusion Criteria:\n\n* Male subjects and female of non-childbearing potential subjects between the ages of 18 and 55.\n* BMI between 18.5 to 32 kg/m2.\n* Total body weight ≥40 kg and ≤120 kg.\n\nExclusion Criteria:\n\n* Previous treatment with an antibody within 6 months prior to Day 1.\n* Pregnant or nursing females; females of childbearing potential.\n* History of sensitivity to heparin or heparin-induced thrombocytopenia.@&ALL@&None@&None@&None@&None@&2014-06@&COMPLETED"}

  • The delimiter in this case is @&
  • Total number of fields are 20 all separated by the delimiter

I am trying to encode the document (jsonl) using the Dense Encoder :

python -m pyserini.encode input --corpus transformed_data.jsonl --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --delimiter "@&" --shard-id 0 --shard-num 1 output --embeddings pyserini_embeddings --to-faiss encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields 'brief_title', 'brief_summary', 'detailed_description', 'overall_status', 'condition', 'gender', 'gender_based', 'minimum_age', 'maximum_age', 'keyword', 'mesh_term', 'drugs', 'diseases', 'Eligibility', 'sex', 'organ', 'adverse_events', 'serious_affect', 'country', 'completion_date', 'Status' --batch 32 --device cpu

The error after running the above command is :

Output : 481384it [00:11, 41781.69it/s]
0%| | 0/15044 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Applications/anaconda3/envs/pyserini/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Applications/anaconda3/envs/pyserini/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Applications/anaconda3/envs/pyserini/lib/python3.8/site-packages/pyserini/encode/main.py", line 138, in
'texts': batch_info['text'],
KeyError: 'text'

I tried to inspect the main.py
in pyserini/encode , the parser for "field" argument is :
input_parser.add_argument('--fields', help='fields that contents in jsonl has (in order)', nargs='+', default=['text'], required=False)
After this parsing,

collection_iterator = JsonlCollectionIterator(args.input.corpus, args.input.fields, args.input.docid_field, delimiter)
with embedding_writer:
        for batch_info in collection_iterator(batch_size, args.input.shard_id, args.input.shard_num):
            kwargs = {
                'texts': batch_info['text'],
                'titles': batch_info['title'] if 'title' in args.encoder.fields else None,
                'expands': batch_info['expand'] if 'expand' in args.encoder.fields else None,
                'fp16': args.encoder.fp16,
                'max_length': args.encoder.max_length,
                'add_sep': args.encoder.add_sep,
            } 

It means that the collection iterator mandatorily expects "text" field, can store "title" field and "expands" field. Is it possible to expand it to any number of desired fields? Thanks!

Hi! If your goal is to encode an entire piece of text, there's no need to specify --fields under input or encoder. Simply aggregating everything into the contents and stripping away any delimiters through your preprocessing scripts should suffice.

The parameter encoder --fields <fields supported by your encoder> is used to direct the encoder on which inputs to consider. For instance, tct-colbert can process texts (the default argument for Pyserini text encoders) as well as titles. Various encoders might accept different types of input fields, provided they have been appropriately trained on such data. For example, unicoil can also process expands.

Got it. Thanks.