Use a manifest file for indexing but paper-qa still uses LLM during indexing.

Question

Use a manifest file for indexing but paper-qa still uses LLM during indexing.

Snikch63200 opened this issue 9 months ago · comments

Hello,

I created a manifest file as follows, before indexing :

def make_manifest_file(path,name='Manifest.csv',delimiter=','):
        
    with open(name, 'w', newline='') as file:
        writer = csv.writer(file,delimiter=delimiter)
        field = ["title", "DOI", "title"]
        
        writer.writerow(field)
        
        for currentpath, folders, files in os.walk(path):
            for file in files:
            
                try:
                    writer.writerow([file, None, os.path.join(currentpath, file)])
                except Exception as e: print(e)

The tutorial here (https://github.com/Future-House/paper-qa?tab=readme-ov-file#manifest-files) explains crating a manifest file avoids llm usage for information like DOI retrieval but indexing still uses LLM...

My problem is DOI, year, authors, etc... cannot be retrieved from docs and then, I don't need it.

I'm just looking for a simple solution to add docs with only title (in fact, name of the PDF file) to an indes to perform agentic query.

Is it possible ?

Best regards.

dosubot · Answer 1 · Wed Feb 12 2025 18:16:48 GMT+0800 (China Standard Time)

To index documents in PaperQA using only the title (or the name of the PDF file) without relying on LLMs for metadata retrieval, you can create a manifest file with just the file_location and title columns. The doi column can be left empty if you don't need it.

Here's how you can adjust your manifest file creation:

Ensure your manifest file includes only the file_location and title columns. The file_location should contain the path to the PDF, and the title can be the name of the PDF file.
Update your make_manifest_file function to correctly reflect these columns:

def make_manifest_file(path, name='Manifest.csv', delimiter=','):
    with open(name, 'w', newline='') as file:
        writer = csv.writer(file, delimiter=delimiter)
        fields = ["file_location", "title"]
        
        writer.writerow(fields)
        
        for currentpath, folders, files in os.walk(path):
            for file in files:
                try:
                    writer.writerow([os.path.join(currentpath, file), file])
                except Exception as e:
                    print(e)

Set the agent.index.manifest_file in your PaperQA configuration to point to the path of your manifest CSV file.

By following these steps, you should be able to index documents using only the title without needing additional metadata like DOI, year, or authors ^[1]^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report}

Snikch63200 · Answer 2 · Thu Feb 13 2025 15:35:26 GMT+0800 (China Standard Time)

It doesn't work.

Script return this error while indexing :

Error reading manifest file Manifest.csv.
Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\envs\PaperQA_env\Lib\site-packages\paperqa\agents\search.py", line 430, in maybe_get_manifest
    records = [DocDetails(**r) for r in csv.DictReader(content.splitlines())]
               ^^^^^^^^^^^^^^^
TypeError: keywords must be strings

And then still uses LLM whant performing indexing task...

Any solution ?

dosubot · Answer 3 · Thu Feb 13 2025 15:37:05 GMT+0800 (China Standard Time)

To resolve the "TypeError: keywords must be strings" error, ensure that your manifest file is correctly formatted with the expected column names and data types. The manifest file should have the following columns: file_location, doi, title, key, docname, and citation. Here's how you can adjust your manifest file creation:

Update your make_manifest_file function to include the necessary columns:

def make_manifest_file(path, name='Manifest.csv', delimiter=','):
    with open(name, 'w', newline='') as file:
        writer = csv.writer(file, delimiter=delimiter)
        fields = ["file_location", "doi", "title", "key", "docname", "citation"]
        
        writer.writerow(fields)
        
        for currentpath, folders, files in os.walk(path):
            for file in files:
                try:
                    file_location = os.path.join(currentpath, file)
                    writer.writerow([file_location, "", file, file, file, "_"])
                except Exception as e:
                    print(e)

Ensure that the use_doc_details setting is set to False in the ParsingSettings to prevent the use of LLMs during the indexing process:

settings = Settings(
    agent=AgentSettings(
        index=IndexSettings(
            manifest_file="Manifest.csv",
            # other settings...
        )
    ),
    parsing=ParsingSettings(
        use_doc_details=False,
        # other settings...
    ),
)

build_index(settings=settings)

By ensuring the manifest file is correctly formatted and setting use_doc_details to False, you should be able to index documents without using LLMs ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report}

Snikch63200 · Answer 4 · Thu Feb 13 2025 15:43:34 GMT+0800 (China Standard Time)

use_doc_details is correctly set as False in the ParsingSettings. That's why I don't understand indexing still uses LLM... Is it a bug ?

where can I find build_index function you mentioned ?

dosubot · Answer 5 · Thu Feb 13 2025 15:45:40 GMT+0800 (China Standard Time)

The build_index function is defined in the paperqa/agents/__init__.py file. Its role is to build a PaperQA search index, and it can also be triggered automatically when using the ask function ^[1].

If use_doc_details is set to False and the system still uses LLMs during indexing, it might be due to other settings or conditions in the code that trigger LLM usage. Ensure that all relevant settings are correctly configured and that there are no other parts of the code that might override this setting. If everything seems correct and the issue persists, it could potentially be a bug.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report}

Snikch63200 · Answer 6 · Thu Feb 13 2025 16:48:59 GMT+0800 (China Standard Time)

Cannot solve this problem...

I try to index without using LLM because LLM is blocking with some PDF files (I guess it is a problem with PDF themselves). Setting parsing.disable_doc_valid_check on True or False has no effect.

Unfortunately, I can't have the error return to help to understand the problem. (I set verbose=2)

Setting use_doc_details on False resolved this problem with Doc() object's method add_document preventing LLM usage during integrating documents.

Any solution ?

Snikch63200 · Answer 7 · Fri Feb 14 2025 20:53:02 GMT+0800 (China Standard Time)

Half-solved the problem,

Indexing still uses llm i don't know why...

But LLM doesn't 'blocks' anymore while setting agent.index.concurrency from 4 to 1 and answer.max_concurrent_requests from 4 to 1.

I guess my local LLM was overloading...

gdurif · Answer 8 · Thu Feb 27 2025 23:56:06 GMT+0800 (China Standard Time)

Hi,

I am battling with the same question (using manifest file for indexing).

I followed @dosu recommendation, but I got the following error when calling build_index(settings=settings):

/xxx/yyy/.venv/lib/python3.12/site-packages/pygments/regexopt.py:26: RuntimeWarning: coroutine 'get_directory_index' was never awaited
  def regex_opt_inner(strings, open_paren):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/xxx/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/re/__init__.py:260: RuntimeWarning: coroutine 'get_directory_index' was never awaited
  return pattern.translate(_special_chars_map)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[11], line 2
      1 # index building
----> 2 build_index(settings=settings)

File /xxx/yyy/.venv/lib/python3.12/site-packages/paperqa/agents/__init__.py:141, in build_index(index_name, directory, settings)
    139 if directory:
    140     settings.agent.index.paper_directory = directory
--> 141 return get_loop().run_until_complete(get_directory_index(settings=settings))

File /xxx/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py:662, in BaseEventLoop.run_until_complete(self, future)
    651 """Run until the Future is done.
    652 
    653 If the argument is a coroutine, it is wrapped in a Task.
   (...)
    659 Return the Future's result, or raise its exception.
    660 """
    661 self._check_closed()
--> 662 self._check_running()
    664 new_task = not futures.isfuture(future)
    665 future = tasks.ensure_future(future, loop=self)

File /xxx/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py:621, in BaseEventLoop._check_running(self)
    619 def _check_running(self):
    620     if self.is_running():
--> 621         raise RuntimeError('This event loop is already running')
    622     if events._get_running_loop() is not None:
    623         raise RuntimeError(
    624             'Cannot run the event loop while another loop is running')

RuntimeError: This event loop is already running

Complete code below:

import os

from paperqa import Settings
from paperqa.agents import build_index
from paperqa.settings import AgentSettings, IndexSettings, ParsingSettings

os.environ['OPENAI_API_KEY'] = "ollama"

model="ollama/llama3.2"

paper_directory = "data/subsample14"
manifest_file="data/subsample14/manifest.csv"

local_llm_config = {
    "model_list": [
        {
            "model_name": model,
            "litellm_params": dict(
                model=model,
                api_base="http://localhost:11434",
            ),
        }
    ]
}

agent = AgentSettings(
    agent_llm=model,
    agent_llm_config=local_llm_config,
    index=IndexSettings(
        manifest_file=manifest_file,
    ),
)

settings=Settings(
    temperature=0.5,
    llm=model,
    llm_config=local_llm_config,
    summary_llm=model,
    summary_llm_config=local_llm_config,
    embedding="ollama/mxbai-embed-large",
    agent=agent,
    paper_directory=paper_directory,
    parsing=ParsingSettings(
        use_doc_details=False,
    ),
    #verbosity=3
)

settings.answer.answer_max_sources = 3
settings.answer.evidence_k = 5

# index building
build_index(settings=settings)

dosubot · Answer 9 · Thu Feb 27 2025 23:59:34 GMT+0800 (China Standard Time)

To address the issues you're encountering:

LLM Usage During Indexing: If use_doc_details is set to False and LLMs are still being used, ensure that no other settings or conditions in your configuration are triggering LLM usage. Double-check the manifest file to ensure it is correctly formatted with the necessary columns (file_location, doi, title) and that the doi column is not required if you don't need it ^[1].
Concurrency Settings: Adjusting agent.index.concurrency and answer.max_concurrent_requests to 1 seems to have alleviated the blocking issue, likely due to reducing the load on your local LLM. This suggests that the problem might be related to resource limitations when handling multiple concurrent requests.
Event Loop Error: The "event loop is already running" error typically occurs in environments like Jupyter Notebooks. To resolve this, you can use the nest_asyncio library to allow nested event loops. Add the following code before calling build_index:
```
import nest_asyncio
nest_asyncio.apply()
```
This should help you avoid the event loop error by allowing the existing loop to run nested tasks.

By ensuring these configurations and adjustments, you should be able to index documents without unnecessary LLM usage and avoid event loop issues.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

Joaquin Polonuer · Answer 10 · Tue Mar 04 2025 05:15:57 GMT+0800 (China Standard Time)

Hi @Snikch63200!
We've been running PaperQA2 on a dataset with similar requirements, and made a tutorial on that.

Also, about this line in your error:

records = [DocDetails(**r) for r in csv.DictReader(content.splitlines())]

We don't have that line anymore, I think it could be useful to install the latest paper-qa==5.17.0 version, since we made some related changes when working w that dataset.

Please let me know if this helps.

Snikch63200 · Answer 11 · Tue Mar 04 2025 18:32:15 GMT+0800 (China Standard Time)

Hi @JoaquinPolonuer,

Thanks for you're answer.

I've updated to 5.17.0.

I creatad the manifest as you recomanded but still uses LLM while indexing...

Here's my complete code :

# -*- coding: utf-8 -*-

from paperqa.settings import Settings, AgentSettings, AnswerSettings
from paperqa.agents import ask, build_index
from paperqa import Docs, HybridEmbeddingModel,SparseEmbeddingModel,LiteLLMEmbeddingModel, SentenceTransformerEmbeddingModel, NumpyVectorStore, embedding_model_factory
from prompts_french import summary_prompt,summary_json_prompt,qa_prompt, select_paper_prompt, citation_prompt, structured_citation_prompt, default_system_prompt, summary_json_system_prompt
import os
import csv
import litellm
import glob
import compress_pickle

from paperqa.agents.main import agent_query
from paperqa.agents.search import get_directory_index

import asyncio
from datetime import datetime 

from litellm.caching.caching import Cache

litellm.set_verbose=True
litellm.stream=True
# litellm.cache=Cache()
litellm.request_timeout = 600

os.environ["OPENAI_API_KEY"] = "sk-no-key-required"
os.environ['MISTRAL_API_KEY'] = "sk-no-key-required"

local_llm_config = dict(
                        model_list=[
                                    dict(
                                        model_name="ollama/dolphin-mixtral:8x7b",
                                        litellm_params=dict(
                                                            model="ollama/dolphin-mixtral:8x7b",
                                                            api_type= "ollama",
                                                            api_base="http://localhost:11434",
                                                            stream=True,
                                                            caching=True,
                                                            cache= True,
#                                                            format='json',
                                                            mirostat=0,
                                                            mirostat_eta=0.1,
                                                            mirostat_tau=5.0,
                                                            num_ctx=8192, #Default 2048
                                                            repeat_last_n=64, # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
                                                            repeat_penalty=1.1,
                                                            temperature=0.5,
                                                            seed=0,
                                                            tfs_z=1,
                                                            num_predict=8192,# Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)
                                                            max_tokens=8192,
#                                                            low_vram=True,
                                                            top_k=20,
                                                            top_p=0.9,
                                                            min_p=0.0,
                                                            timeout = 600,
                                                            ),
                                        ),

                                    ]
                        )

settings=Settings(
    llm="ollama/dolphin-mixtral:8x7b",
    llm_config=local_llm_config,
    summary_llm="ollama/dolphin-mixtral:8x7b",
    summary_llm_config=local_llm_config,
    verbosity=0,
    batch_size = 1024,
    embedding=f"ollama/mxbai-embed-large", #"ollama/granite3-dense:8b",
    index_directory="./indexes",
    paper_directory="./Papers/ToAdd",
    agent=AgentSettings(
                        agent_llm_config=local_llm_config,
                        agent_llm="ollama/dolphin-mixtral:8x7b",
                        agent_type="fake",
                        search_count=8,
                        should_pre_search=False,
                        index_concurrency=1,
                        timeout=600.,
                        ),
    )

settings.prompts.qa = qa_prompt
# settings.prompts.post = "Traduire la réponse ({answer})en Français si elle n'est pas déjà en Français."
settings.prompts.summary = summary_prompt
settings.prompts.select = select_paper_prompt
settings.prompts.system = default_system_prompt
settings.prompts.summary_json = summary_json_prompt
settings.prompts.use_json = True
#settings.prompts.summary_json_system = summary_json_system_prompt
settings.parsing.citation_prompt = citation_prompt
settings.parsing.structured_citation_prompt = structured_citation_prompt
settings.answer.evidence_summary_length = "Environs 100 mots."
settings.answer.answer_length = "Environs 500 mots, mais peut être plus longue si nécessaire."
settings.answer.answer_max_sources = 3 # 12
settings.answer.evidence_k = 5 # 25
settings.answer.evidence_retrieval = True
settings.answer.evidence_detailed_citations = True
settings.answer.answer_filter_extra_background = False
settings.answer.max_concurrent_requests = 1

settings.parsing.use_doc_details = False
settings.parsing.defer_embedding = True
settings.parsing.disable_doc_valid_check = True
settings.parsing.chunk_size = 7000
settings.parsing.overlap = 750

settings.agent.index.name = "index1"
settings.agent.index.paper_directory = "./Papers/ToAdd"
settings.agent.index.manifest_file = 'Manifest.csv'
settings.agent.index.index_directory = "./indexes"
settings.agent.index.use_absolute_paper_directory = False
settings.agent.index.recurse_subdirectories = True
settings.agent.index.concurrency = 1
settings.agent.index.sync_with_paper_directory = True

def make_manifest_file(path,name='Manifest.csv',delimiter=';'):
        
    with open(name, 'w', newline='') as file:
        writer = csv.writer(file, dialect='excel',delimiter=delimiter)
        field = ["file_location", "doi", "title", "key", "docname", "citation", "year"]
        
        writer.writerow(field)
        
        for currentpath, folders, files in os.walk(path):
            for file in files:
            
                try:
                    writer.writerow([os.path.join(currentpath, file),"",file,file,file,"_", "2025"])
                except Exception as e: print(e)

                
    
async def ask_question(settings) -> None:

    answer_response = await agent_query(
                                           query="Rédigez une synthèse des exemples connus de sépultures mérovingiennes ou carolingiennes bien datées. Répondez en Français.",
                                           settings=settings,
                                        )
                                            
    print(answer_response.session)

if __name__ == "__main__":

    start_time = datetime.now() 

    make_manifest_file('./Papers/ToAdd',name='Manifest.csv',delimiter=',')
    build_index(settings=settings)
    asyncio.run(ask_question(settings))

    time_elapsed = datetime.now() - start_time 

    print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

I guess the problem is related to the manifest formatting...

Help welcome...

Best regards.

Joaquin Polonuer · Answer 12 · Wed Mar 05 2025 05:14:59 GMT+0800 (China Standard Time)

Hey @Snikch63200 !
Can you check out test_get_directory_index_w_no_citations in tests/test_agents.py?
Note that it checks that the citation is taken from the manifest, you can also see the associated stub_manifest_nocitation.csv file that it uses.
I hope this helps.

Joaquin Polonuer · Answer 13 · Wed Mar 05 2025 05:20:46 GMT+0800 (China Standard Time)

You can also try to debug this by putting breakpoints in the functions maybe_get_manifest and fetch_kwargs_from_manifest in paparqa/agents/search.py, which are responsible of getting the data from the file.

Snikch63200 · Answer 14 · Thu Mar 06 2025 20:52:33 GMT+0800 (China Standard Time)

How can i imports test_get_directory_index_w_no_citations from tests/test_agents.py?

from paperqa.tests.test_agents test_get_directory_index_w_manifest doesn't work...

@dosu.

dosubot · Answer 15 · Thu Mar 06 2025 20:54:34 GMT+0800 (China Standard Time)

To import the test_get_directory_index_w_no_citations function from tests/test_agents.py in the PaperQA project, use the following import statement:

from tests.test_agents import test_get_directory_index_w_no_citations

This function is defined as an asynchronous test function using the pytest.mark.asyncio decorator. It tests the behavior of the directory index when there are no citations by setting the manifest_file to "stub_manifest_nocitation.csv" and asserting that the top result does not have a citation ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

Snikch63200 · Answer 16 · Thu Mar 06 2025 21:10:54 GMT+0800 (China Standard Time)

Returns following error :

No module named 'test.test_agents'

James Braza · Answer 17 · Fri Mar 07 2025 02:58:12 GMT+0800 (China Standard Time)

How can i imports test_get_directory_index_w_no_citations from tests/test_agents.py?

You can only get this from a local git clone, as our PyPI package only has source code.

Also, I think Joaquin was more just saying to use test_get_directory_index_w_no_citations as a reference, not actually to import and invoke