Programmatic Verification of Data Contract on a Data Stored in the Py Spark Data Frame not Working!!
shrivatsashetty opened this issue · comments
I'm trying to implement programmatic checks on a Data using Soda Data Contract. I have a table in the form of an Py Spark Data frame. I have defined a data contact with some Data Quality checks on the data and when I run the contract over the data specifying the path to the contracts file or contract string, the spark context object, the name of the temp view, I get the following error.
Can anyone please help us resolve this. We are working on an important project and very much dependent on Soda's Data Contract. If this error resolves, we can complete the project sucessfully
[17:54:18] Query execution error in spark_ds.emp_duplicate.schema[emp_duplicate]:
DESCRIBE emp_duplicate[17:54:18] Error occurred while executing scan.
| 'SparkDfCursor' object has no attribute 'fetchmany'
Here I think soda is first performing a Schema Checks which it does by default for a data contract even if we don't mention that explicitly. Under the hood, it's trying to execute an SQL query to check the schema (DESC spark_df_temp_view_name
) but's it's not able to fetch the results of the query as indicated by the erorr message 'SparkDfCursor' object has no attribute 'fetchmany'
.
So, I tried to skip the schema checks as follows:
contract = contract_verification.contracts[0]
for check in contract.checks:
print(check.type)
if check.type == "schema":
check.skip = True`
And here's a preview of my contracts file emp_duplicates.yaml
dataset: emp_duplicate
columns:
- name: EmployeeID
checks:
- type: no_duplicate_values
- type: no_missing_values
- name: "First Name"
checks:
- type: no_missing_values
name: "every employee must have a first name"
- type: no_invalid_values
valid_min_length: 1
name: "Employees First name should not be blank"
Here's how my Python code looks like, I'm testing it on Jupyter Notebook :
# creating a spark temporary view object (In memory SQL table) from spark data frame
emp_duplicate_spark_df.createOrReplaceTempView("emp_duplicate")
# import soda data contract programming API
from soda.contracts.contract_verification import ContractVerification, ContractVerificationResult
# verifying for any syntax errors in contract file, this prints "All good, no errror"
contract_verification: ContractVerification = (
ContractVerification.builder()
.with_contract_yaml_file( f"{os.getenv('PLAYGROUND')}\\soda\\local_files\\data_contracts\\employees_duplicate.yaml")
.build()
)
if contract_verification.logs.has_errors():
print(f"The contract has syntax or semantic errors: \n{contract_verification.logs}")
else:
print("All Good, No error")
# providing path to contract file, pyspark table view, spark context
contract_verification : ContractVerification = (
ContractVerification.builder()\
.with_contract_yaml_file(contract_yaml_file_path = f"{os.getenv('PLAYGROUND')}\\soda\\local_files\\data_contracts\\employees_duplicate.yaml")\
.with_warehouse_spark_session(spark_session = spark_session, warehouse_name = "emp_duplicate")\
.build()
)
# skipping the schema checks
contract = contract_verification.contracts[0]
for check in contract.checks:
print(check.type)
if check.type == "schema":
check.skip = True
# executing the contract
contract_verification_result: ContractVerificationResult = contract_verification.execute()
After doing all this, this is the error I get:
[18:25:30] Query execution error in spark_ds.EmployeeID.failed_rows[duplicate_count]:
WITH frequencies AS (
SELECT EmployeeID
FROM emp_duplicate
WHERE EmployeeID IS NOT NULL
GROUP BY EmployeeID
HAVING COUNT(*) > 1)
SELECT main.*
FROM emp_duplicate main
JOIN frequencies ON main.EmployeeID = frequencies.EmployeeID
LIMIT 100
[18:25:30] Error occurred while executing scan.
| 'SparkDfCursor' object has no attribute 'fetchmany'
As you can see it's the same error, 'SparkDfCursor' object has no attribute 'fetchmany'
I'm I doing something wrong here ?
Can you share try re-creating your virtual environment?
And then share compare your pip list
against this list?
(.venv) [soda-core] pip list
Package Version Editable project location
---------------------------------------- ----------- -----------------------------------------
altair 5.3.0
annotated-types 0.7.0
antlr4-python3-runtime 4.11.1
anyio 4.4.0
asn1crypto 1.5.1
attrs 23.2.0
backoff 2.2.1
black 22.6.0
bleach 6.1.0
blinker 1.8.2
boto3 1.34.123
botocore 1.34.123
build 1.2.1
cachetools 5.3.3
certifi 2024.2.2
cffi 1.16.0
cfgv 3.4.0
charset-normalizer 3.3.2
cli-ui 0.17.2
click 8.1.7
cloudpickle 3.0.0
cmdstanpy 1.2.3
colorama 0.4.6
contourpy 1.2.1
coverage 7.5.1
cryptography 42.0.8
cycler 0.12.1
Cython 3.0.10
dask 2023.7.1
dask-sql 2023.8.0
databricks-sql-connector 3.1.2
Deprecated 1.2.14
distlib 0.3.8
distributed 2023.7.1
dnspython 2.6.1
docker 6.1.3
docopt 0.6.2
docutils 0.20.1
duckdb 1.0.0
email_validator 2.1.1
et-xmlfile 1.1.0
exceptiongroup 1.2.1
Faker 13.16.0
fastapi 0.111.0
fastapi-cli 0.0.4
filelock 3.14.0
fonttools 4.53.0
fsspec 2024.6.0
future 1.0.0
gitdb 4.0.11
GitPython 3.1.43
google-api-core 2.19.0
google-auth 2.30.0
google-cloud-bigquery 3.24.0
google-cloud-core 2.4.1
google-crc32c 1.5.0
google-resumable-media 2.7.1
googleapis-common-protos 1.63.1
grpcio 1.64.1
grpcio-status 1.62.2
h11 0.14.0
holidays 0.50
httpcore 1.0.5
httptools 0.6.1
httpx 0.27.0
ibm_db 3.2.3
identify 2.5.36
idna 3.7
importlib-metadata 6.11.0
importlib_resources 6.4.0
inflect 7.2.1
inflection 0.5.1
iniconfig 2.0.0
Jinja2 3.1.4
jmespath 1.0.1
jsonschema 4.22.0
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
locket 1.0.0
lz4 4.3.3
markdown-it-py 3.0.0
MarkupSafe 2.1.2
matplotlib 3.9.0
mdurl 0.1.2
more-itertools 10.3.0
msgpack 1.0.8
mypy-extensions 1.0.0
mysql-connector-python 8.0.30
networkx 3.2.1
nodeenv 1.9.1
numpy 1.26.4
oauthlib 3.2.2
openpyxl 3.1.3
opentelemetry-api 1.22.0
opentelemetry-exporter-otlp-proto-common 1.22.0
opentelemetry-exporter-otlp-proto-http 1.22.0
opentelemetry-proto 1.22.0
opentelemetry-sdk 1.22.0
opentelemetry-semantic-conventions 0.43b0
oracledb 1.1.1
orjson 3.10.4
packaging 24.0
pandas 1.5.3
partd 1.4.2
pathspec 0.12.1
pillow 10.3.0
pip 24.0
pip-tools 7.4.1
platformdirs 4.2.2
plotly 5.22.0
pluggy 1.5.0
pre-commit 3.7.1
prompt_toolkit 3.0.47
prophet 1.1.5
proto-plus 1.23.0
protobuf 3.19.6
psutil 5.9.8
psycopg2-binary 2.9.9
pure-sasl 0.6.2
py 1.11.0
py4j 0.10.9.7
pyarrow 14.0.2
pyasn1 0.6.0
pyasn1_modules 0.4.0
pyathena 2.25.2
pyatlan 2.2.4
pycparser 2.22
pycryptodome 3.20.0
pydantic 2.7.3
pydantic_core 2.18.4
pydeck 0.9.1
Pygments 2.18.0
PyHive 0.7.0
PyJWT 2.8.0
pyodbc 5.1.0
pyOpenSSL 24.1.0
pyparsing 3.1.2
pyproject_hooks 1.1.0
pyspark 3.5.1
pytest 7.4.4
pytest-cov 3.0.0
pytest-html 3.2.0
pytest-metadata 3.1.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
readme-renderer 32.0
referencing 0.35.1
requests 2.31.0
rich 13.7.1
rpds-py 0.18.1
rsa 4.9
ruamel.yaml 0.17.40
ruamel.yaml.clib 0.2.8
s3transfer 0.10.1
sasl 0.3.1
schema 0.7.7
scipy 1.13.1
setuptools 69.5.1
shellingham 1.5.4
six 1.16.0
smmap 5.0.1
sniffio 1.3.1
snowflake-connector-python 3.10.1
soda-core 3.3.5 /Users/tom/Code/soda-core/soda/core
soda-core-athena 3.3.5 /Users/tom/Code/soda-core/soda/athena
soda-core-atlan 3.3.5 /Users/tom/Code/soda-core/soda/atlan
soda-core-bigquery 3.3.5 /Users/tom/Code/soda-core/soda/bigquery
soda-core-contracts 3.3.5 /Users/tom/Code/soda-core/soda/contracts
soda-core-db2 3.3.5 /Users/tom/Code/soda-core/soda/db2
soda-core-denodo 3.3.5 /Users/tom/Code/soda-core/soda/denodo
soda-core-dremio 3.3.5 /Users/tom/Code/soda-core/soda/dremio
soda-core-duckdb 3.3.5 /Users/tom/Code/soda-core/soda/duckdb
soda-core-mysql 3.3.5 /Users/tom/Code/soda-core/soda/mysql
soda-core-oracle 3.3.5 /Users/tom/Code/soda-core/soda/oracle
soda-core-pandas-dask 3.3.5 /Users/tom/Code/soda-core/soda/dask
soda-core-postgres 3.3.5 /Users/tom/Code/soda-core/soda/postgres
soda-core-redshift 3.3.5 /Users/tom/Code/soda-core/soda/redshift
soda-core-scientific 3.3.5 /Users/tom/Code/soda-core/soda/scientific
soda-core-snowflake 3.3.5 /Users/tom/Code/soda-core/soda/snowflake
soda-core-spark 3.3.5 /Users/tom/Code/soda-core/soda/spark
soda-core-spark-df 3.3.5 /Users/tom/Code/soda-core/soda/spark_df
soda-core-sqlserver 3.3.5 /Users/tom/Code/soda-core/soda/sqlserver
soda-core-teradata 3.3.5 /Users/tom/Code/soda-core/soda/teradata
soda-core-trino 3.3.5 /Users/tom/Code/soda-core/soda/trino
soda-core-vertica 3.3.5 /Users/tom/Code/soda-core/soda/vertica
sortedcontainers 2.4.0
sqlparse 0.5.0
stanio 0.5.0
starlette 0.37.2
streamlit 1.35.0
tabulate 0.8.10
tblib 3.0.0
tbump 6.11.0
tenacity 8.2.3
teradatasql 20.0.0.12
thrift 0.16.0
thrift-sasl 0.4.3
toml 0.10.2
tomli 2.0.1
tomlkit 0.11.8
toolz 0.12.1
tornado 6.4.1
tox 3.28.0
tox-docker 4.1.0
tqdm 4.66.4
trino 0.328.0
typeguard 4.3.0
typer 0.12.3
typing_extensions 4.11.0
tzdata 2024.1
tzlocal 5.2
ujson 5.10.0
Unidecode 1.3.8
urllib3 1.26.18
uvicorn 0.30.1
uvloop 0.19.0
vertica-python 1.3.8
virtualenv 20.26.2
watchfiles 0.22.0
wcwidth 0.2.13
webencodings 0.5.1
websocket-client 1.8.0
websockets 12.0
wheel 0.43.0
wrapt 1.16.0
zict 3.0.0
zipp 3.19.2
Hello,
Installing version 3.3.5 of soda-core-contracts, soda-core-spark-df & pyspark 3.3.6 worked and now the checks are getting executed. But still I found the following issues:
I'm trying to read data from a CSV file with the columns EmployeeID
and First Name
. This is how my contracts file looks like:
dataset: emp_duplicate
columns:
- name: EmployeeID
checks:
- type: no_duplicate_values
name: "Column = EmployeeID, Check = No Duplicates"
- type: no_missing_values
name: "Column = EmployeeID, Check = No Missing Values"
- name: First Name
checks:
- type: no_missing_values
name: "Column = First Name, Check = No Missing Values"
- type: no_invalid_values
valid_min_length: 4
name: "Column = First Name, Check = Min Length Check"
- type: no_invalid_values
valid_min_length: 1
name: "Column = First Name, Check = Empty String Check"
When I run the checks I get the following errors:
error | SodaCL: Invalid check "missing_count(First Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error | SodaCL: Invalid check "invalid_count(First Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error | SodaCL: Invalid check "invalid_count(First Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error | SodaCL: Invalid check "invalid_count(Last Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error | SodaCL: Invalid check "duplicate_count(Phone No) = 0": mismatched input 'No' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error | SodaCL: Invalid check "missing_count(Phone No) = 0": mismatched input 'No' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error | SodaCL: Invalid check "invalid_count(Phone No) = 0": mismatched input 'No' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
I Think the issue is the column name having two words separated by a whitespace. Syntactically that does not make a difference for the YAML file, but when Soda is internally trying to execute the SQL query on the column it's encountering an error due to the column name seperated by a whitespace. So I tried to enclose the all the column names in the YAML contracts file with double quotes as follows:
dataset: emp_duplicate
# DQ Checks
# if the column name has two words seperated by a whitespace,
# make sure to enclose the column name in double quotes
columns:
- name: "EmployeeID"
checks:
- type: no_duplicate_values
name: "Column = EmployeeID, Check = No Duplicates"
- type: no_missing_values
name: "Column = EmployeeID, Check = No Missing Values"
- name: "First Name"
checks:
- type: no_missing_values
name: "Column = First Name, Check = No Missing Values"
- type: no_invalid_values
valid_min_length: 4
name: "Column = First Name, Check = Min Length Check"
And then I encounter the follwing error:
error | SodaCL: Query error: spark_ds."EmployeeID".failed_rows[duplicate_count]:
[PARSE_SYNTAX_ERROR] Syntax error at or near '"EmployeeID"'.(line 10, pos 25)
== SQL ==
WITH frequencies AS (
SELECT "EmployeeID"
FROM emp_duplicate
WHERE "EmployeeID" IS NOT NULL
GROUP BY "EmployeeID"
HAVING COUNT(*) > 1)
SELECT main.*
FROM emp_duplicate main
JOIN frequencies ON main."EmployeeID" = frequencies."EmployeeID"
-------------------------^^^
LIMIT 100
I think this time It's because of the column EmployeeID
being enclosed in double quotes and soda is internally quoting it with single quotes as indicated by this line of error: Syntax error at or near '"EmployeeID"'.(line 10, pos 25)
So what I did was I converted all the headings of the CSV to camel case so that there would be no whitespaces in the column headings & there will be no need to quote the column headings with double quotes. Then my contracts YAML file looked like this:
dataset: emp_duplicate
# DQ Checks
# if the column name has two words seperated by a whitespace,
# make sure to enclose the column name in double quotes
columns:
- name: employee_id
checks:
- type: no_duplicate_values
name: "Column = EmployeeID, Check = No Duplicates"
- type: no_missing_values
name: "Column = EmployeeID, Check = No Missing Values"
- name: first_name
checks:
- type: no_missing_values
name: "Column = First Name, Check = No Missing Values"
by doing this the scan ran successfully without throwing any errors:
# Contract results for emp_duplicate
9 check failures and 0 execution errors
Check FAILED [Column = EmployeeID, Check = No Duplicates]
Expected duplicate_count(employee_id) = 0
Actual duplicate_count(employee_id) was 3
Check FAILED [Column = First Name, Check = No Missing Values]
Expected missing_count(first_name) = 0
Actual missing_count(first_name) was 1
Check FAILED [Column = First Name, Check = Min Length Check]
Expected invalid_count(first_name) = 0
Actual invalid_count(first_name) was 2
So far we have got a temporary fix for the problem by renaming the column names in came case. But it would be great if you guys could have a look at the issue and fix it
Thanks a lot 👍
Impressive 💪 Thanks for the feedback! I'll put it on the backlog to test for names with spaces and quoting.
Hello Tom,
In our team we are trying to automate implementation of Soda's Data Contract and generation of report.
By using the Soda Data Contract Python API, we were able to produce the verification of results by following the documentation. But the results that we get as you know are in the form of text logs like one below:
# Contract results for emp_duplicate
42 check failures and 0 execution errors
Check FAILED [Column = EmployeeID, Check = No Duplicates]
Expected duplicate_count(employee_id) = 0
Actual duplicate_count(employee_id) was 2
Check FAILED [Column = EmployeeID, Check = No Missing Values]
Expected missing_count(employee_id) = 0
Actual missing_count(employee_id) was 1
.
.
.
What we wanted was to extract the data contract check statistics in the form of object that would provide us with the following information:
- The Soda Contract Check that was performed on a column
- The column name upon which the check was performed
- Total no of rows under the column that were scanned during the execution of the check or if it's a table level check then total records scanned din the table
- total no of records that failed the check
So that the following check result log:
Check FAILED [Valid format for Email ID]
Expected invalid_count(email) = 0
Actual invalid_count(email) was 1
would look something like this in the form of an object or a python dictionary:
{
"column_name": "email_id",
"check_performed": "no_invalid_value",
"configuration_key": "valid_format",
"configuration_key_value": "email",
"no_of_records_scanned": "16",
"no_of_failures": 1
}
Now we can covert the above results as a single record in a panda's data frame
Hence capturing all the check results like this we can form a table containing results of the execution & that can also act as our report.
Something like this
column_name | check_performed | configuration_key | configuration_key_value | no_of_records_scanned | no_of_failures |
---|---|---|---|---|---|
email_id | no_invalid_value | valid_format | 16 | 1 | |
dob | no_invalid_value | valid_format | date eu | 16 | 2 |
But the appropriate methods to get the check results in the form that I mentioned above is not available in the Documentation.
Therefore I'm using my Vs Code IDE prompts and searching through the source code for ways I can achieve it. So far, I'm not able to find any success. Can you please help me out. Is there any other source of reference. We are on a tight deadline and have to complete the project. We can even connect over mail or something if if you want. My mal Id: shrivatsa307@gmail.com
Hi @shrivatsashetty -- thanks for the detail and the offer to connect. I'm sure Tom will see the above comment as well, but you can also find him on Slack in our Soda Community: https://community.soda.io/slack Cheers!