implement deep search graph
VinciGit00 opened this issue · comments
I'll help you with this as we discussed yesterday. Let's go 🤝
@VinciGit00 main question is if any cascading deep search graph should have its own specific config, or even that is propagated
@VinciGit00 can you please add some details on which are the things you are planning to do on top of what I have contributed?
Based on my experience while implementing and testing the deep search, these are the areas I feel need a lot of attention:
- Finding out the most relevant links for the search given a large number of links on a webpage
- Accurately deciding if a webpage has any answer related to the question
@mayurdb coping with both your points might require having two cascading conditional nodes in each block
one deciding if the information collected by now is enough to answer the initial query; if the answer is negative, and the most relevant links are fetched, the other conditional node should weigh them and decide if there's any additional link that might aid in answering the query with undiscovered information or simply return with incomplete information.
I had a talk with @VinciGit00 a few days ago. Some of the nodes are likely to receive an overhaul to their inner workings, but without any breaking changes to their outer interface. He didn't like the idea of having to add signals right into the base node, at the core of the node system. That's why he rolled back one of @mayurdb's commits.
We then tried to figure out if there could be a way to reimplement deep scraping using the original nodes, without signals nor runtime loops. My initial idea was to use recursion, but the graph engine doesn't seem to handle it well - it would cause stack overflows even with stop conditions in place.
I then considered to readapt Mayur's original idea of iterating on a fetch/parse/rag/search_link
sequence for each link through a GraphIterator
- however, instead of using a runtime loop for the n layers, the graph could be pre-built with iteration in the constructor. That way, the resulting graph would still be a tree, without loops, and wouldn't mess with the current graph engine. Vinci suggested to add a conditional node for early returns.
The last thing we need to figure out before implementing this is what conditions to put in place for early return. Ideally, we could implement two modes for the graph:
- a "cheap mode" that returns as soon as any relevant information is found
- an "accurate mode" that keeps on crawling as long as there are relevant links, even if some relevant information has been found at the current level
This is not just deep scraping: if this configuration works - for how inelegant it may be in its current form - it would further prove that @PeriniM's modular system is expressive enough for complex scraping operations.
As for the actual implementation, Vinci asked me to have another talk before writing any code. That will take another couple of days to be possible. I'll be glad to have Mayur as a co-author and/or reviewer.
Closing note: the right-side diagram is wrong. It's not a chain of deep scrapers, but of graph iterators running deep scrapers. Between each layer of graph iterators, there are merge nodes both for links and answers, and a conditional node to provide an early return. I feel like there's something off with the left diagram too, but I'd need to look at the original drawings we made on paper.
@f-aguzzi, exchanged some ideas with @VinciGit00 as well; we should collect all thoughts, and devise the most sensible solution without requiring a major refactoring effort from the current modular graph execution engine.
you could get the best of both cheap and accurate modes if you decouple responsibilities.
A very rough pseudo implementation might look like the following, but we have to simplify it, matching our nodes:
stateDiagram-v2
note left of deep_search_graph
- base_endpoint
- max_rounds
- max_subgraphs
- user_query
end note
state deep_search_graph {
[*] --> plan_node
note right of plan_node
- available_information
- available_endpoints
- visited_endpoints
- rounds
- user_query
- early_exit := rounds EQ max_rounds
- missing_information_query
end note
state rounds_expired <<choice>>
state enough_information <<choice>>
plan_node --> conditional_node_rounds
conditional_node_rounds --> rounds_expired
rounds_expired --> conditional_node : if early_exit false
rounds_expired --> merge_answers_node_final : if early_exit true
conditional_node --> enough_information
enough_information --> parallel_search_graph: not enough information
enough_information --> merge_answers_node_final : enough information
state parallel_search_graph {
state fork_state <<fork>>
state join_state <<join>>
[*] --> rerank_link_node
note right of rerank_link_node
- missing_information_query
- endpoints
end note
rerank_link_node --> graph_iterator_node
note left of graph_iterator_node
- k EQ max_subgraphs
- top_k_endpoints
end note
graph_iterator_node --> fork_state
fork_state --> explore_graph_1
fork_state --> explore_graph_2
fork_state --> explore_graph_k
state explore_graph_1 {
[*] --> fetch_node
fetch_node --> parse_node
parse_node --> RAG_node
RAG_node --> generate_answers_node
generate_answers_node --> search_link_node
search_link_node --> [*]
}
explore_graph_1 --> join_state
explore_graph_2 --> join_state
explore_graph_k --> join_state
join_state --> merge_answers_node
note right of merge_answers_node
- information
- endpoints
FIXME: how'd we handle concat of link candidates?
end note
merge_answers_node --> [*]
}
parallel_search_graph --> plan_node
merge_answers_node_final --> [*]
}
the search link node should modified to return a link with a short description (should do not make the fetch) directly from the link,
rerank link node should be a simple vector database on links and descriptions for the time being