Use a manifest file for indexing but paper-qa still uses LLM during indexing.
Snikch63200 opened this issue · comments
Hello,
I created a manifest file as follows, before indexing :
def make_manifest_file(path,name='Manifest.csv',delimiter=','):
with open(name, 'w', newline='') as file:
writer = csv.writer(file,delimiter=delimiter)
field = ["title", "DOI", "title"]
writer.writerow(field)
for currentpath, folders, files in os.walk(path):
for file in files:
try:
writer.writerow([file, None, os.path.join(currentpath, file)])
except Exception as e: print(e)
The tutorial here (https://github.com/Future-House/paper-qa?tab=readme-ov-file#manifest-files) explains crating a manifest file avoids llm usage for information like DOI retrieval but indexing still uses LLM...
My problem is DOI, year, authors, etc... cannot be retrieved from docs and then, I don't need it.
I'm just looking for a simple solution to add docs with only title (in fact, name of the PDF file) to an indes to perform agentic query.
Is it possible ?
Best regards.
To index documents in PaperQA using only the title (or the name of the PDF file) without relying on LLMs for metadata retrieval, you can create a manifest file with just the file_location and title columns. The doi column can be left empty if you don't need it.
Here's how you can adjust your manifest file creation:
-
Ensure your manifest file includes only the
file_locationandtitlecolumns. Thefile_locationshould contain the path to the PDF, and thetitlecan be the name of the PDF file. -
Update your
make_manifest_filefunction to correctly reflect these columns:
def make_manifest_file(path, name='Manifest.csv', delimiter=','):
with open(name, 'w', newline='') as file:
writer = csv.writer(file, delimiter=delimiter)
fields = ["file_location", "title"]
writer.writerow(fields)
for currentpath, folders, files in os.walk(path):
for file in files:
try:
writer.writerow([os.path.join(currentpath, file), file])
except Exception as e:
print(e)- Set the
agent.index.manifest_filein your PaperQA configuration to point to the path of your manifest CSV file.
By following these steps, you should be able to index documents using only the title without needing additional metadata like DOI, year, or authors [1][2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report
It doesn't work.
Script return this error while indexing :
Error reading manifest file Manifest.csv.
Traceback (most recent call last):
File "C:\ProgramData\anaconda3\envs\PaperQA_env\Lib\site-packages\paperqa\agents\search.py", line 430, in maybe_get_manifest
records = [DocDetails(**r) for r in csv.DictReader(content.splitlines())]
^^^^^^^^^^^^^^^
TypeError: keywords must be strings
And then still uses LLM whant performing indexing task...
Any solution ?
To resolve the "TypeError: keywords must be strings" error, ensure that your manifest file is correctly formatted with the expected column names and data types. The manifest file should have the following columns: file_location, doi, title, key, docname, and citation. Here's how you can adjust your manifest file creation:
- Update your
make_manifest_filefunction to include the necessary columns:
def make_manifest_file(path, name='Manifest.csv', delimiter=','):
with open(name, 'w', newline='') as file:
writer = csv.writer(file, delimiter=delimiter)
fields = ["file_location", "doi", "title", "key", "docname", "citation"]
writer.writerow(fields)
for currentpath, folders, files in os.walk(path):
for file in files:
try:
file_location = os.path.join(currentpath, file)
writer.writerow([file_location, "", file, file, file, "_"])
except Exception as e:
print(e)- Ensure that the
use_doc_detailssetting is set toFalsein theParsingSettingsto prevent the use of LLMs during the indexing process:
settings = Settings(
agent=AgentSettings(
index=IndexSettings(
manifest_file="Manifest.csv",
# other settings...
)
),
parsing=ParsingSettings(
use_doc_details=False,
# other settings...
),
)
build_index(settings=settings)By ensuring the manifest file is correctly formatted and setting use_doc_details to False, you should be able to index documents without using LLMs [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report
use_doc_details is correctly set as False in the ParsingSettings. That's why I don't understand indexing still uses LLM... Is it a bug ?
where can I find build_index function you mentioned ?
The build_index function is defined in the paperqa/agents/__init__.py file. Its role is to build a PaperQA search index, and it can also be triggered automatically when using the ask function [1].
If use_doc_details is set to False and the system still uses LLMs during indexing, it might be due to other settings or conditions in the code that trigger LLM usage. Ensure that all relevant settings are correctly configured and that there are no other parts of the code that might override this setting. If everything seems correct and the issue persists, it could potentially be a bug.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report
Cannot solve this problem...
I try to index without using LLM because LLM is blocking with some PDF files (I guess it is a problem with PDF themselves). Setting parsing.disable_doc_valid_check on True or False has no effect.
Unfortunately, I can't have the error return to help to understand the problem. (I set verbose=2)
Setting use_doc_details on False resolved this problem with Doc() object's method add_document preventing LLM usage during integrating documents.
Any solution ?
Half-solved the problem,
Indexing still uses llm i don't know why...
But LLM doesn't 'blocks' anymore while setting agent.index.concurrency from 4 to 1 and answer.max_concurrent_requests from 4 to 1.
I guess my local LLM was overloading...
Hi,
I am battling with the same question (using manifest file for indexing).
I followed @dosu recommendation, but I got the following error when calling build_index(settings=settings):
/xxx/yyy/.venv/lib/python3.12/site-packages/pygments/regexopt.py:26: RuntimeWarning: coroutine 'get_directory_index' was never awaited
def regex_opt_inner(strings, open_paren):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/xxx/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/re/__init__.py:260: RuntimeWarning: coroutine 'get_directory_index' was never awaited
return pattern.translate(_special_chars_map)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[11], line 2
1 # index building
----> 2 build_index(settings=settings)
File /xxx/yyy/.venv/lib/python3.12/site-packages/paperqa/agents/__init__.py:141, in build_index(index_name, directory, settings)
139 if directory:
140 settings.agent.index.paper_directory = directory
--> 141 return get_loop().run_until_complete(get_directory_index(settings=settings))
File /xxx/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py:662, in BaseEventLoop.run_until_complete(self, future)
651 """Run until the Future is done.
652
653 If the argument is a coroutine, it is wrapped in a Task.
(...)
659 Return the Future's result, or raise its exception.
660 """
661 self._check_closed()
--> 662 self._check_running()
664 new_task = not futures.isfuture(future)
665 future = tasks.ensure_future(future, loop=self)
File /xxx/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py:621, in BaseEventLoop._check_running(self)
619 def _check_running(self):
620 if self.is_running():
--> 621 raise RuntimeError('This event loop is already running')
622 if events._get_running_loop() is not None:
623 raise RuntimeError(
624 'Cannot run the event loop while another loop is running')
RuntimeError: This event loop is already running
Complete code below:
import os
from paperqa import Settings
from paperqa.agents import build_index
from paperqa.settings import AgentSettings, IndexSettings, ParsingSettings
os.environ['OPENAI_API_KEY'] = "ollama"
model="ollama/llama3.2"
paper_directory = "data/subsample14"
manifest_file="data/subsample14/manifest.csv"
local_llm_config = {
"model_list": [
{
"model_name": model,
"litellm_params": dict(
model=model,
api_base="http://localhost:11434",
),
}
]
}
agent = AgentSettings(
agent_llm=model,
agent_llm_config=local_llm_config,
index=IndexSettings(
manifest_file=manifest_file,
),
)
settings=Settings(
temperature=0.5,
llm=model,
llm_config=local_llm_config,
summary_llm=model,
summary_llm_config=local_llm_config,
embedding="ollama/mxbai-embed-large",
agent=agent,
paper_directory=paper_directory,
parsing=ParsingSettings(
use_doc_details=False,
),
#verbosity=3
)
settings.answer.answer_max_sources = 3
settings.answer.evidence_k = 5
# index building
build_index(settings=settings)To address the issues you're encountering:
-
LLM Usage During Indexing: If
use_doc_detailsis set toFalseand LLMs are still being used, ensure that no other settings or conditions in your configuration are triggering LLM usage. Double-check the manifest file to ensure it is correctly formatted with the necessary columns (file_location,doi,title) and that thedoicolumn is not required if you don't need it [1]. -
Concurrency Settings: Adjusting
agent.index.concurrencyandanswer.max_concurrent_requeststo 1 seems to have alleviated the blocking issue, likely due to reducing the load on your local LLM. This suggests that the problem might be related to resource limitations when handling multiple concurrent requests. -
Event Loop Error: The "event loop is already running" error typically occurs in environments like Jupyter Notebooks. To resolve this, you can use the
nest_asynciolibrary to allow nested event loops. Add the following code before callingbuild_index:import nest_asyncio nest_asyncio.apply()
This should help you avoid the event loop error by allowing the existing loop to run nested tasks.
By ensuring these configurations and adjustments, you should be able to index documents without unnecessary LLM usage and avoid event loop issues.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Hi @Snikch63200!
We've been running PaperQA2 on a dataset with similar requirements, and made a tutorial on that.
Also, about this line in your error:
records = [DocDetails(**r) for r in csv.DictReader(content.splitlines())]We don't have that line anymore, I think it could be useful to install the latest paper-qa==5.17.0 version, since we made some related changes when working w that dataset.
Please let me know if this helps.
Hi @JoaquinPolonuer,
Thanks for you're answer.
I've updated to 5.17.0.
I creatad the manifest as you recomanded but still uses LLM while indexing...
Here's my complete code :
# -*- coding: utf-8 -*-
from paperqa.settings import Settings, AgentSettings, AnswerSettings
from paperqa.agents import ask, build_index
from paperqa import Docs, HybridEmbeddingModel,SparseEmbeddingModel,LiteLLMEmbeddingModel, SentenceTransformerEmbeddingModel, NumpyVectorStore, embedding_model_factory
from prompts_french import summary_prompt,summary_json_prompt,qa_prompt, select_paper_prompt, citation_prompt, structured_citation_prompt, default_system_prompt, summary_json_system_prompt
import os
import csv
import litellm
import glob
import compress_pickle
from paperqa.agents.main import agent_query
from paperqa.agents.search import get_directory_index
import asyncio
from datetime import datetime
from litellm.caching.caching import Cache
litellm.set_verbose=True
litellm.stream=True
# litellm.cache=Cache()
litellm.request_timeout = 600
os.environ["OPENAI_API_KEY"] = "sk-no-key-required"
os.environ['MISTRAL_API_KEY'] = "sk-no-key-required"
local_llm_config = dict(
model_list=[
dict(
model_name="ollama/dolphin-mixtral:8x7b",
litellm_params=dict(
model="ollama/dolphin-mixtral:8x7b",
api_type= "ollama",
api_base="http://localhost:11434",
stream=True,
caching=True,
cache= True,
# format='json',
mirostat=0,
mirostat_eta=0.1,
mirostat_tau=5.0,
num_ctx=8192, #Default 2048
repeat_last_n=64, # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
repeat_penalty=1.1,
temperature=0.5,
seed=0,
tfs_z=1,
num_predict=8192,# Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)
max_tokens=8192,
# low_vram=True,
top_k=20,
top_p=0.9,
min_p=0.0,
timeout = 600,
),
),
]
)
settings=Settings(
llm="ollama/dolphin-mixtral:8x7b",
llm_config=local_llm_config,
summary_llm="ollama/dolphin-mixtral:8x7b",
summary_llm_config=local_llm_config,
verbosity=0,
batch_size = 1024,
embedding=f"ollama/mxbai-embed-large", #"ollama/granite3-dense:8b",
index_directory="./indexes",
paper_directory="./Papers/ToAdd",
agent=AgentSettings(
agent_llm_config=local_llm_config,
agent_llm="ollama/dolphin-mixtral:8x7b",
agent_type="fake",
search_count=8,
should_pre_search=False,
index_concurrency=1,
timeout=600.,
),
)
settings.prompts.qa = qa_prompt
# settings.prompts.post = "Traduire la réponse ({answer})en Français si elle n'est pas déjà en Français."
settings.prompts.summary = summary_prompt
settings.prompts.select = select_paper_prompt
settings.prompts.system = default_system_prompt
settings.prompts.summary_json = summary_json_prompt
settings.prompts.use_json = True
#settings.prompts.summary_json_system = summary_json_system_prompt
settings.parsing.citation_prompt = citation_prompt
settings.parsing.structured_citation_prompt = structured_citation_prompt
settings.answer.evidence_summary_length = "Environs 100 mots."
settings.answer.answer_length = "Environs 500 mots, mais peut être plus longue si nécessaire."
settings.answer.answer_max_sources = 3 # 12
settings.answer.evidence_k = 5 # 25
settings.answer.evidence_retrieval = True
settings.answer.evidence_detailed_citations = True
settings.answer.answer_filter_extra_background = False
settings.answer.max_concurrent_requests = 1
settings.parsing.use_doc_details = False
settings.parsing.defer_embedding = True
settings.parsing.disable_doc_valid_check = True
settings.parsing.chunk_size = 7000
settings.parsing.overlap = 750
settings.agent.index.name = "index1"
settings.agent.index.paper_directory = "./Papers/ToAdd"
settings.agent.index.manifest_file = 'Manifest.csv'
settings.agent.index.index_directory = "./indexes"
settings.agent.index.use_absolute_paper_directory = False
settings.agent.index.recurse_subdirectories = True
settings.agent.index.concurrency = 1
settings.agent.index.sync_with_paper_directory = True
def make_manifest_file(path,name='Manifest.csv',delimiter=';'):
with open(name, 'w', newline='') as file:
writer = csv.writer(file, dialect='excel',delimiter=delimiter)
field = ["file_location", "doi", "title", "key", "docname", "citation", "year"]
writer.writerow(field)
for currentpath, folders, files in os.walk(path):
for file in files:
try:
writer.writerow([os.path.join(currentpath, file),"",file,file,file,"_", "2025"])
except Exception as e: print(e)
async def ask_question(settings) -> None:
answer_response = await agent_query(
query="Rédigez une synthèse des exemples connus de sépultures mérovingiennes ou carolingiennes bien datées. Répondez en Français.",
settings=settings,
)
print(answer_response.session)
if __name__ == "__main__":
start_time = datetime.now()
make_manifest_file('./Papers/ToAdd',name='Manifest.csv',delimiter=',')
build_index(settings=settings)
asyncio.run(ask_question(settings))
time_elapsed = datetime.now() - start_time
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))
I guess the problem is related to the manifest formatting...
Help welcome...
Best regards.
Hey @Snikch63200 !
Can you check out test_get_directory_index_w_no_citations in tests/test_agents.py?
Note that it checks that the citation is taken from the manifest, you can also see the associated stub_manifest_nocitation.csv file that it uses.
I hope this helps.
You can also try to debug this by putting breakpoints in the functions maybe_get_manifest and fetch_kwargs_from_manifest in paparqa/agents/search.py, which are responsible of getting the data from the file.
How can i imports test_get_directory_index_w_no_citations from tests/test_agents.py?
from paperqa.tests.test_agents test_get_directory_index_w_manifest doesn't work...
To import the test_get_directory_index_w_no_citations function from tests/test_agents.py in the PaperQA project, use the following import statement:
from tests.test_agents import test_get_directory_index_w_no_citationsThis function is defined as an asynchronous test function using the pytest.mark.asyncio decorator. It tests the behavior of the directory index when there are no citations by setting the manifest_file to "stub_manifest_nocitation.csv" and asserting that the top result does not have a citation [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Returns following error :
No module named 'test.test_agents'
How can i imports
test_get_directory_index_w_no_citationsfrom tests/test_agents.py?
You can only get this from a local git clone, as our PyPI package only has source code.
Also, I think Joaquin was more just saying to use test_get_directory_index_w_no_citations as a reference, not actually to import and invoke