enforce_privacy dose not work?
gDanzel opened this issue Β· comments
System Info
OS version: win11
Python version: 3.11
The current version of pandasai being used: 2.0.36
π Describe the bug
The sample data appears in the prompt even set enforce_privacy True.
The Code below:
import pandasai.pandas as pd
from pandasai import Agent
from pandasai.helpers import get_openai_callback
from pandasai.llm import OpenAI, GoogleGemini
from data.sample_dataframe import dataframe
llm = OpenAI()
agent = Agent([pd.DataFrame(dataframe)], config={"llm": llm, "enforce_privacy": True, "verbose": True})
with get_openai_callback() as cb:
response = agent.chat("Get the top 3 GDP countries.")
print(response)
print(cb)
And can see the print out of prompt, the dataframe still with data:
2024-05-04 15:08:41 [INFO] Question: Get the top 3 GDP countries.
2024-05-04 15:08:42 [INFO] Running PandasAI with openai LLM...
2024-05-04 15:08:42 [INFO] Prompt ID: 50302077-57f3-482a-a823-64e2be596f5d
2024-05-04 15:08:42 [INFO] Executing Pipeline: GenerateChatPipeline
2024-05-04 15:08:42 [INFO] Executing Step 0: ValidatePipelineInput
2024-05-04 15:08:42 [INFO] Executing Step 1: CacheLookup
2024-05-04 15:08:42 [INFO] Executing Step 2: PromptGeneration
2024-05-04 15:08:46 [INFO] Using prompt: <dataframe>
dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22
</dataframe>
Update this initial code:
\```python
\# TODO: import the required dependencies
import pandas as pd
\# Write code here
\# Declare result var:
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }
### QUERY
Get the top 3 GDP countries.
Variable dfs: list[pd.DataFrame]
is already declared.
At the end, declare "result" variable as a dictionary of type and value.
If you are asked to plot a chart, use "matplotlib" for charts, save as png.
Generate python code and return full updated code:
2024-05-04 15:08:46 [INFO] Executing Step 3: CodeGenerator
2024-05-04 15:08:49 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-04 15:08:49 [INFO] Prompt used:
dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22
Update this initial code:
# TODO: import the required dependencies
import pandas as pd
# Write code here
# Declare result var:
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }
QUERY
Get the top 3 GDP countries.
Variable dfs: list[pd.DataFrame]
is already declared.
At the end, declare "result" variable as a dictionary of type and value.
If you are asked to plot a chart, use "matplotlib" for charts, save as png.
Generate python code and return full updated code:
2024-05-04 15:08:49 [INFO] Code generated:
```
# TODO: import the required dependencies
import pandas as pd
Write code here
top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')
Declare result var
result = {
"type": "dataframe",
"value": top_3_gdp_countries
}
```
2024-05-04 15:08:49 [INFO] Executing Step 4: CachePopulation
2024-05-04 15:08:49 [INFO] Executing Step 5: CodeCleaning
2024-05-04 15:08:49 [INFO]
Code running:
top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')
result = {'type': 'dataframe', 'value': top_3_gdp_countries}
```
2024-05-04 15:08:49 [INFO] Executing Step 6: CodeExecution
2024-05-04 15:08:49 [INFO] Executing Step 7: ResultValidation
2024-05-04 15:08:49 [INFO] Answer: {'type': 'dataframe', 'value': country gdp happiness_index
0 United States 19294482071552 6.94
9 China 14631844184064 5.12
8 Japan 4380756541440 5.87}
2024-05-04 15:08:49 [INFO] Executing Step 8: ResultParsing
country gdp happiness_index
0 United States 19294482071552 6.94
9 China 14631844184064 5.12
8 Japan 4380756541440 5.87
Tokens Used: 340
Prompt Tokens: 270
Completion Tokens: 70
Total Cost (USD): $ 0.000240
Process finished with exit code 0
Facing same issue. enforce privacy is working till v2.0.28.
I think it's due to how pandasai/helpers/dataframe_serializer.py -> convert_df_to_csv() doesn't care at all about the enforce_privacy
config setting, it's not checking for it, neither does it check for custom_head
.
it happily just adds the details:
# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"
Until this gets properly fixed, I replaced above code with:
# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"
In contrast, pandasai/helpers/dataframe_serializer.py -> convert_df_to_json() properly checks for enforce_privacy
and custom_head
Related: #1147
After some more digging, it seems you can get enforce_privacy
and custom head
to work by forcing it to use the YML/json serialization, you just need to specify field descriptions.
If you add field descriptions,
convert_df_to_yml()
will be used if you provide field descriptions...
# If field descriptions are added always use YML. Other formats don't support field descriptions yet
if self.field_descriptions or self.connector_relations:
serializer = DataframeSerializerType.YML
..and then...
def serialize(
self,
df: pd.DataFrame,
extras: dict = None,
type_: DataframeSerializerType = DataframeSerializerType.YML,
) -> str:
if type_ == DataframeSerializerType.YML:
return self.convert_df_to_yml(df, extras)
elif type_ == DataframeSerializerType.JSON:
return self.convert_df_to_json_str(df, extras)
elif type_ == DataframeSerializerType.SQL:
return self.convert_df_sql_connector_to_str(df, extras)
else:
return self.convert_df_to_csv(df, extras)
convert_df_to_yml()
will serialize the field descriptions in YML, and internally use convert_df_to_json()
to do the rest (respecting enforce_privacy
and custom head
.