(FeatureRequest)Get LLM to use specific SQL in the question-sql pair in the chroma database

Question

(FeatureRequest)Get LLM to use specific SQL in the question-sql pair in the chroma database

jankariwo opened this issue 3 months ago · comments

Ive been setting up the chroma database with ddl,sql and business documentation.
The vn.generate_sql is not doing very well with generating sql when the question contains company specific terms (which have been added in the documentation).
When I used similar1 = vn.get_similar_question_sql(question)....I get 10 results with the first one matching what I have in chroma. So I appended this to the prompt vn.generate_sql(question + similar1[0]).......the results were better.

Feature request:
If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

André Pedersen · Answer 1 · Tue Apr 02 2024 23:27:51 GMT+0800 (China Standard Time)

If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

Hello, @jankariwo! :]

Im not sure I follow. If you train on a question-sql pair, it will be stored in the chroma vector DB. If you then attempt to ask the same question again, the question-sql pair should be one of the top 10 most similar question-sql pairs. And then, it will be part of the prompt sent to perform the new SQL completion. It it very likely that it will realize that this question is already part of the context and thus knowns which SQL query to produce. So I would expect it more often than not to yield the same SQL query as output.

Are you not seeing this? Or are you maybe seeing that this does not always seem to be the case, but it works most of the time?

Zain Hoda · Answer 2 · Fri Apr 05 2024 22:43:15 GMT+0800 (China Standard Time)

@andreped I think the feature request is something like:

If the question is exactly the same as one of the trained question-SQL pairs, then instead of using the LLM to generate the SQL, it should bypass the LLM and just return the SQL that was associated with that question.

jankariwo · Answer 3 · Sat Apr 06 2024 02:46:33 GMT+0800 (China Standard Time)

@andreped I think the feature request is something like:

If the question is exactly the same as one of the trained question-SQL pairs, then instead of using the LLM to generate the SQL, it should bypass the LLM and just return the SQL that was associated with that question.

That is correct interpretation

jankariwo · Answer 4 · Sat Apr 06 2024 02:50:28 GMT+0800 (China Standard Time)

If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

Hello, @jankariwo! :]

Im not sure I follow. If you train on a question-sql pair, it will be stored in the chroma vector DB. If you then attempt to ask the same question again, the question-sql pair should be one of the top 10 most similar question-sql pairs. And then, it will be part of the prompt sent to perform the new SQL completion. It it very likely that it will realize that this question is already part of the context and thus knowns which SQL query to produce. So I would expect it more often than not to yield the same SQL query as output.

Are you not seeing this? Or are you maybe seeing that this does not always seem to be the case, but it works most of the time?

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

André Pedersen · Answer 5 · Sat Apr 06 2024 03:28:09 GMT+0800 (China Standard Time)

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!

I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

sen yang · Answer 6 · Tue Apr 09 2024 11:08:00 GMT+0800 (China Standard Time)

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!

I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

Thank you! So, can I understand it this way: 'correct query' stores questions, and each time a search is conducted, it retrieves the question vectors in the 'correct query' dataset. When the similarity exceeds a certain value, it directly returns the answer corresponding to the 'correct query'?

jankariwo · Answer 7 · Wed Apr 10 2024 01:39:18 GMT+0800 (China Standard Time)

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!
I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

Thank you! So, can I understand it this way: 'correct query' stores questions, and each time a search is conducted, it retrieves the question vectors in the 'correct query' dataset. When the similarity exceeds a certain value, it directly returns the answer corresponding to the 'correct query'?

Yup. It returns the matched query instead of sending a prompt to the LLM to generate the SQL

André Pedersen · Answer 8 · Wed Apr 10 2024 02:32:15 GMT+0800 (China Standard Time)

@zainhoda I assume this feature was not part of the last release? Would be of great interest for our applications as well :]

Zain Hoda · Answer 9 · Wed Apr 10 2024 02:43:55 GMT+0800 (China Standard Time)

@andreped if you (or anyone else) would like to take a quick stab at this, I think this is relatively simple change.

Add something like this:

    # Iterate through each item in the list
    for item in question_sql_list:
        # Check if the current item's question matches the input question
        if item['question'] == question:
            # If a match is found, return the corresponding sql value
            return item['sql']

right after this line:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/base.py#L110

and then maybe add a quick test for this

jankariwo · Answer 10 · Wed Apr 10 2024 22:23:35 GMT+0800 (China Standard Time)

@andreped if you (or anyone else) would like to take a quick stab at this, I think this is relatively simple change.

Add something like this:
    # Iterate through each item in the list
    for item in question_sql_list:
        # Check if the current item's question matches the input question
        if item['question'] == question:
            # If a match is found, return the corresponding sql value
            return item['sql']
right after this line:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/base.py#L110

and then maybe add a quick test for this

Thanks @zainhoda. Ive implemented. Added it to my base.py and tested it. I also created my function in vanna class with it to test the outputs. It works perfectly. You just have to lookout for trailing whitespace in the question which might not make them to match. But this can be easily fixed with extra string cleaning code before checking for a match. Thanks again.

André Pedersen · Answer 11 · Wed Apr 10 2024 22:24:37 GMT+0800 (China Standard Time)

@jankariwo Can you make a PR?