implement deep search graph

Question

implement deep search graph

VinciGit00 opened this issue 2 months ago · comments

Marco Vinciguerra commented 2 months ago

Is your feature request related to a problem? Please describe.
Implement in a good way the deep search graph

Describe the solution you'd like

Federico Aguzzi · Answer 1 · Fri May 17 2024 16:05:32 GMT+0800 (China Standard Time)

I'll help you with this as we discussed yesterday. Let's go 🤝

Federico Minutoli · Answer 2 · Sat May 18 2024 17:24:23 GMT+0800 (China Standard Time)

@VinciGit00 main question is if any cascading deep search graph should have its own specific config, or even that is propagated $n$ times through the chain

Mayur Bhosale · Answer 3 · Sat May 18 2024 17:28:47 GMT+0800 (China Standard Time)

@VinciGit00 can you please add some details on which are the things you are planning to do on top of what I have contributed?

Based on my experience while implementing and testing the deep search, these are the areas I feel need a lot of attention:

Finding out the most relevant links for the search given a large number of links on a webpage
Accurately deciding if a webpage has any answer related to the question

Federico Minutoli · Answer 4 · Sat May 18 2024 17:35:16 GMT+0800 (China Standard Time)

@mayurdb coping with both your points might require having two cascading conditional nodes in each block

one deciding if the information collected by now is enough to answer the initial query; if the answer is negative, and the most relevant links are fetched, the other conditional node should weigh them and decide if there's any additional link that might aid in answering the query with undiscovered information or simply return with incomplete information.

Federico Aguzzi · Answer 5 · Sat May 18 2024 22:10:57 GMT+0800 (China Standard Time)

I had a talk with @VinciGit00 a few days ago. Some of the nodes are likely to receive an overhaul to their inner workings, but without any breaking changes to their outer interface. He didn't like the idea of having to add signals right into the base node, at the core of the node system. That's why he rolled back one of @mayurdb's commits.

We then tried to figure out if there could be a way to reimplement deep scraping using the original nodes, without signals nor runtime loops. My initial idea was to use recursion, but the graph engine doesn't seem to handle it well - it would cause stack overflows even with stop conditions in place.

I then considered to readapt Mayur's original idea of iterating on a fetch/parse/rag/search_link sequence for each link through a GraphIterator - however, instead of using a runtime loop for the n layers, the graph could be pre-built with iteration in the constructor. That way, the resulting graph would still be a tree, without loops, and wouldn't mess with the current graph engine. Vinci suggested to add a conditional node for early returns.

The last thing we need to figure out before implementing this is what conditions to put in place for early return. Ideally, we could implement two modes for the graph:

a "cheap mode" that returns as soon as any relevant information is found
an "accurate mode" that keeps on crawling as long as there are relevant links, even if some relevant information has been found at the current level

This is not just deep scraping: if this configuration works - for how inelegant it may be in its current form - it would further prove that @PeriniM's modular system is expressive enough for complex scraping operations.

As for the actual implementation, Vinci asked me to have another talk before writing any code. That will take another couple of days to be possible. I'll be glad to have Mayur as a co-author and/or reviewer.

Closing note: the right-side diagram is wrong. It's not a chain of deep scrapers, but of graph iterators running deep scrapers. Between each layer of graph iterators, there are merge nodes both for links and answers, and a conditional node to provide an early return. I feel like there's something off with the left diagram too, but I'd need to look at the original drawings we made on paper.

Federico Minutoli · Answer 6 · Sun May 19 2024 00:00:16 GMT+0800 (China Standard Time)

@f-aguzzi, exchanged some ideas with @VinciGit00 as well; we should collect all thoughts, and devise the most sensible solution without requiring a major refactoring effort from the current modular graph execution engine.

you could get the best of both cheap and accurate modes if you decouple responsibilities.

A very rough pseudo implementation might look like the following, but we have to simplify it, matching our nodes:

stateDiagram-v2
    note left of deep_search_graph
    - base_endpoint
    - max_rounds
    - max_subgraphs
    - user_query
    end note

    state deep_search_graph {
        [*] --> plan_node

        note right of plan_node
        - available_information
        - available_endpoints
        - visited_endpoints
        - rounds
        - user_query
        - early_exit := rounds EQ max_rounds
        - missing_information_query
        end note

        state rounds_expired <<choice>>
        state enough_information <<choice>>

        plan_node --> conditional_node_rounds
        conditional_node_rounds --> rounds_expired 
        rounds_expired --> conditional_node : if early_exit false
        rounds_expired --> merge_answers_node_final : if early_exit true
        conditional_node --> enough_information

        enough_information --> parallel_search_graph: not enough information
        enough_information --> merge_answers_node_final : enough information

        state parallel_search_graph {
            state fork_state <<fork>>
            state join_state <<join>>
            
            [*] --> rerank_link_node

            note right of rerank_link_node
            - missing_information_query
            - endpoints
            end note

            rerank_link_node --> graph_iterator_node

            note left of graph_iterator_node
            - k EQ max_subgraphs
            - top_k_endpoints
            end note

            graph_iterator_node --> fork_state

            fork_state --> explore_graph_1
            fork_state --> explore_graph_2
            fork_state --> explore_graph_k

            state explore_graph_1 {
                [*] --> fetch_node
                fetch_node --> parse_node
                parse_node --> RAG_node
                RAG_node --> generate_answers_node
                generate_answers_node --> search_link_node
                search_link_node --> [*]
            }

            explore_graph_1 --> join_state
            explore_graph_2 --> join_state
            explore_graph_k --> join_state
            
            join_state --> merge_answers_node

            note right of merge_answers_node
            - information
            - endpoints

            FIXME: how'd we handle concat of link candidates?
            end note

            merge_answers_node --> [*]
        }

        parallel_search_graph --> plan_node

        merge_answers_node_final --> [*]
    }

Marco Vinciguerra · Answer 7 · Mon Jun 10 2024 22:52:37 GMT+0800 (China Standard Time)

the search link node should modified to return a link with a short description (should do not make the fetch) directly from the link,

Federico Minutoli · Answer 8 · Mon Jun 10 2024 22:58:38 GMT+0800 (China Standard Time)

rerank link node should be a simple vector database on links and descriptions for the time being