castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page:http://pyserini.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error when using four fields in contents ,following pyserini/docs /usage-index.md

ywm5 opened this issue · comments

commented

E:\Anaconda\envs\hyde21\Lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()

0it [00:00, ?it/s]
0it [00:00, ?it/s]
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "E:\Anaconda\envs\hyde21\Lib\site-packages\pyserini\encode_main
.py", line 140, in
collection_iterator = JsonlCollectionIterator(args.input.corpus, args.input.fields, args.input.docid_field, delimiter)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Anaconda\envs\hyde21\Lib\site-packages\pyserini\encode_base.py", line 72, in init
self.all_info = self._load(collection_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\Anaconda\envs\hyde21\Lib\site-packages\pyserini\encode_base.py", line 145, in _load
raise ValueError(
ValueError: 4 fields are found at Line#0 in file tests/resources/simple_cacm_corpus.json.1 fields expected.Line content: www.url.com
title
this is the contents.
document expansion

commented

!python -m pyserini.encode input --corpus tests/resources/simple_cacm_corpus.json --fields text --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings path/to/output/dir encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields text --batch 32 --fp16
simple_cacm_corpus.json is as follows:
{ "id": "doc1", "contents": "www.url.com\ntitle\nthis is the contents.\ndocument expansion"}

commented

oh I forget the fields !python -m pyserini.encode input --corpus tests/resources/simple_cacm_corpus.json --fields url title text expand --delimiter "\n" --shard-id 0 --shard-num 1 output --embeddings path/to/output/dir --to-faiss encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields text --batch 32 --fp16