enforce_privacy dose not work?

Question

enforce_privacy dose not work?

gDanzel opened this issue a month ago · comments

gDanzel commented a month ago

System Info

OS version: win11
Python version: 3.11
The current version of pandasai being used: 2.0.36

🐛 Describe the bug

The sample data appears in the prompt even set enforce_privacy True.

The Code below:

import pandasai.pandas as pd
from pandasai import Agent
from pandasai.helpers import get_openai_callback
from pandasai.llm import OpenAI, GoogleGemini

from data.sample_dataframe import dataframe

llm = OpenAI()

agent = Agent([pd.DataFrame(dataframe)], config={"llm": llm, "enforce_privacy": True, "verbose": True})
with get_openai_callback() as cb:
    response = agent.chat("Get the top 3 GDP countries.")
    print(response)
    print(cb)

And can see the print out of prompt, the dataframe still with data:

2024-05-04 15:08:41 [INFO] Question: Get the top 3 GDP countries.
2024-05-04 15:08:42 [INFO] Running PandasAI with openai LLM...
2024-05-04 15:08:42 [INFO] Prompt ID: 50302077-57f3-482a-a823-64e2be596f5d
2024-05-04 15:08:42 [INFO] Executing Pipeline: GenerateChatPipeline
2024-05-04 15:08:42 [INFO] Executing Step 0: ValidatePipelineInput
2024-05-04 15:08:42 [INFO] Executing Step 1: CacheLookup
2024-05-04 15:08:42 [INFO] Executing Step 2: PromptGeneration
2024-05-04 15:08:46 [INFO] Using prompt: <dataframe>
dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22
</dataframe>




Update this initial code:
\```python
\# TODO: import the required dependencies
import pandas as pd

\# Write code here

\# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

### QUERY
Get the top 3 GDP countries.

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code:
2024-05-04 15:08:46 [INFO] Executing Step 3: CodeGenerator
2024-05-04 15:08:49 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-04 15:08:49 [INFO] Prompt used:

dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22

Update this initial code:

# TODO: import the required dependencies
import pandas as pd

# Write code here

# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

QUERY

Get the top 3 GDP countries.

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code:

2024-05-04 15:08:49 [INFO] Code generated:
```
# TODO: import the required dependencies
import pandas as pd

Write code here

top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')

Declare result var

result = {
"type": "dataframe",
"value": top_3_gdp_countries
}
```

2024-05-04 15:08:49 [INFO] Executing Step 4: CachePopulation
2024-05-04 15:08:49 [INFO] Executing Step 5: CodeCleaning
2024-05-04 15:08:49 [INFO]
Code running:

top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')
result = {'type': 'dataframe', 'value': top_3_gdp_countries}
        ```
2024-05-04 15:08:49 [INFO] Executing Step 6: CodeExecution
2024-05-04 15:08:49 [INFO] Executing Step 7: ResultValidation
2024-05-04 15:08:49 [INFO] Answer: {'type': 'dataframe', 'value':          country             gdp  happiness_index
0  United States  19294482071552             6.94
9          China  14631844184064             5.12
8          Japan   4380756541440             5.87}
2024-05-04 15:08:49 [INFO] Executing Step 8: ResultParsing
         country             gdp  happiness_index
0  United States  19294482071552             6.94
9          China  14631844184064             5.12
8          Japan   4380756541440             5.87
Tokens Used: 340
	Prompt Tokens: 270
	Completion Tokens: 70
Total Cost (USD): $ 0.000240

Process finished with exit code 0

Hrishikesh Dutta · Answer 1 · Mon May 06 2024 13:44:03 GMT+0800 (China Standard Time)

Facing same issue. enforce privacy is working till v2.0.28.

Patrick Lachat (SMG, Group Data Team, Head of Data Engineering) · Answer 2 · Fri May 31 2024 06:53:30 GMT+0800 (China Standard Time)

I think it's due to how pandasai/helpers/dataframe_serializer.py -> convert_df_to_csv() doesn't care at all about the enforce_privacy config setting, it's not checking for it, neither does it check for custom_head.

it happily just adds the details:

# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"

Until this gets properly fixed, I replaced above code with:

# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"

In contrast, pandasai/helpers/dataframe_serializer.py -> convert_df_to_json() properly checks for enforce_privacy and custom_head

Related: #1147

Patrick Lachat (SMG, Group Data Team, Head of Data Engineering) · Answer 3 · Fri May 31 2024 17:02:37 GMT+0800 (China Standard Time)

After some more digging, it seems you can get enforce_privacy and custom head to work by forcing it to use the YML/json serialization, you just need to specify field descriptions.

If you add field descriptions,

convert_df_to_yml() will be used if you provide field descriptions...

# If field descriptions are added always use YML. Other formats don't support field descriptions yet
   if self.field_descriptions or self.connector_relations:
        serializer = DataframeSerializerType.YML

..and then...

    def serialize(
        self,
        df: pd.DataFrame,
        extras: dict = None,
        type_: DataframeSerializerType = DataframeSerializerType.YML,
    ) -> str:
        if type_ == DataframeSerializerType.YML:
            return self.convert_df_to_yml(df, extras)
        elif type_ == DataframeSerializerType.JSON:
            return self.convert_df_to_json_str(df, extras)
        elif type_ == DataframeSerializerType.SQL:
            return self.convert_df_sql_connector_to_str(df, extras)
        else:
            return self.convert_df_to_csv(df, extras)

convert_df_to_yml() will serialize the field descriptions in YML, and internally use convert_df_to_json() to do the rest (respecting enforce_privacy and custom head.