sodadata / docs

Soda Documentation, served at docs.soda.io

Home Page:https://docs.soda.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Programmatic Verification of Data Contract on a Data Stored in the Py Spark Data Frame not Working!!

shrivatsashetty opened this issue · comments

#584 #728 #435 #86 #129

I'm trying to implement programmatic checks on a Data using Soda Data Contract. I have a table in the form of an Py Spark Data frame. I have defined a data contact with some Data Quality checks on the data and when I run the contract over the data specifying the path to the contracts file or contract string, the spark context object, the name of the temp view, I get the following error.
Can anyone please help us resolve this. We are working on an important project and very much dependent on Soda's Data Contract. If this error resolves, we can complete the project sucessfully

[17:54:18] Query execution error in spark_ds.emp_duplicate.schema[emp_duplicate]:
DESCRIBE emp_duplicate

[17:54:18] Error occurred while executing scan.
| 'SparkDfCursor' object has no attribute 'fetchmany'

Here I think soda is first performing a Schema Checks which it does by default for a data contract even if we don't mention that explicitly. Under the hood, it's trying to execute an SQL query to check the schema (DESC spark_df_temp_view_name) but's it's not able to fetch the results of the query as indicated by the erorr message 'SparkDfCursor' object has no attribute 'fetchmany'.

So, I tried to skip the schema checks as follows:

contract = contract_verification.contracts[0]
for check in contract.checks:
    print(check.type)
    if check.type == "schema":
        check.skip = True`

And here's a preview of my contracts file emp_duplicates.yaml

dataset: emp_duplicate

columns:
  - name: EmployeeID
    checks:
      - type: no_duplicate_values
      - type: no_missing_values

  - name: "First Name"
    checks:
      - type: no_missing_values
        name: "every employee must have a first name"

      - type: no_invalid_values
        valid_min_length: 1
        name: "Employees First name should not be blank"

Here's how my Python code looks like, I'm testing it on Jupyter Notebook :

# creating a spark temporary view object (In memory SQL table) from spark data frame
emp_duplicate_spark_df.createOrReplaceTempView("emp_duplicate")


# import soda data contract programming API
from soda.contracts.contract_verification import ContractVerification, ContractVerificationResult


# verifying for any syntax errors in contract file, this prints "All good, no errror"
contract_verification: ContractVerification = (
  ContractVerification.builder()
  .with_contract_yaml_file( f"{os.getenv('PLAYGROUND')}\\soda\\local_files\\data_contracts\\employees_duplicate.yaml")
  .build()
)

if contract_verification.logs.has_errors():
  print(f"The contract has syntax or semantic errors: \n{contract_verification.logs}")
else:
  print("All Good, No error")


# providing path to contract file, pyspark table view, spark context
contract_verification : ContractVerification  = (
    ContractVerification.builder()\
    .with_contract_yaml_file(contract_yaml_file_path = f"{os.getenv('PLAYGROUND')}\\soda\\local_files\\data_contracts\\employees_duplicate.yaml")\
    .with_warehouse_spark_session(spark_session = spark_session, warehouse_name = "emp_duplicate")\
    .build()
)


# skipping the schema checks
contract = contract_verification.contracts[0]
for check in contract.checks:
    print(check.type)
    if check.type == "schema":
        check.skip = True


# executing the contract
contract_verification_result: ContractVerificationResult = contract_verification.execute()

After doing all this, this is the error I get:

[18:25:30] Query execution error in spark_ds.EmployeeID.failed_rows[duplicate_count]: 

WITH frequencies AS (
    SELECT EmployeeID
    FROM emp_duplicate
    WHERE EmployeeID IS NOT NULL
    GROUP BY EmployeeID
    HAVING COUNT(*) > 1)
SELECT main.*
FROM emp_duplicate main
JOIN frequencies ON main.EmployeeID = frequencies.EmployeeID

LIMIT 100

[18:25:30] Error occurred while executing scan.
  | 'SparkDfCursor' object has no attribute 'fetchmany'

As you can see it's the same error, 'SparkDfCursor' object has no attribute 'fetchmany'
I'm I doing something wrong here ?

Can you share try re-creating your virtual environment?

And then share compare your pip list against this list?

(.venv) [soda-core] pip list
Package                                  Version     Editable project location
---------------------------------------- ----------- -----------------------------------------
altair                                   5.3.0
annotated-types                          0.7.0
antlr4-python3-runtime                   4.11.1
anyio                                    4.4.0
asn1crypto                               1.5.1
attrs                                    23.2.0
backoff                                  2.2.1
black                                    22.6.0
bleach                                   6.1.0
blinker                                  1.8.2
boto3                                    1.34.123
botocore                                 1.34.123
build                                    1.2.1
cachetools                               5.3.3
certifi                                  2024.2.2
cffi                                     1.16.0
cfgv                                     3.4.0
charset-normalizer                       3.3.2
cli-ui                                   0.17.2
click                                    8.1.7
cloudpickle                              3.0.0
cmdstanpy                                1.2.3
colorama                                 0.4.6
contourpy                                1.2.1
coverage                                 7.5.1
cryptography                             42.0.8
cycler                                   0.12.1
Cython                                   3.0.10
dask                                     2023.7.1
dask-sql                                 2023.8.0
databricks-sql-connector                 3.1.2
Deprecated                               1.2.14
distlib                                  0.3.8
distributed                              2023.7.1
dnspython                                2.6.1
docker                                   6.1.3
docopt                                   0.6.2
docutils                                 0.20.1
duckdb                                   1.0.0
email_validator                          2.1.1
et-xmlfile                               1.1.0
exceptiongroup                           1.2.1
Faker                                    13.16.0
fastapi                                  0.111.0
fastapi-cli                              0.0.4
filelock                                 3.14.0
fonttools                                4.53.0
fsspec                                   2024.6.0
future                                   1.0.0
gitdb                                    4.0.11
GitPython                                3.1.43
google-api-core                          2.19.0
google-auth                              2.30.0
google-cloud-bigquery                    3.24.0
google-cloud-core                        2.4.1
google-crc32c                            1.5.0
google-resumable-media                   2.7.1
googleapis-common-protos                 1.63.1
grpcio                                   1.64.1
grpcio-status                            1.62.2
h11                                      0.14.0
holidays                                 0.50
httpcore                                 1.0.5
httptools                                0.6.1
httpx                                    0.27.0
ibm_db                                   3.2.3
identify                                 2.5.36
idna                                     3.7
importlib-metadata                       6.11.0
importlib_resources                      6.4.0
inflect                                  7.2.1
inflection                               0.5.1
iniconfig                                2.0.0
Jinja2                                   3.1.4
jmespath                                 1.0.1
jsonschema                               4.22.0
jsonschema-specifications                2023.12.1
kiwisolver                               1.4.5
locket                                   1.0.0
lz4                                      4.3.3
markdown-it-py                           3.0.0
MarkupSafe                               2.1.2
matplotlib                               3.9.0
mdurl                                    0.1.2
more-itertools                           10.3.0
msgpack                                  1.0.8
mypy-extensions                          1.0.0
mysql-connector-python                   8.0.30
networkx                                 3.2.1
nodeenv                                  1.9.1
numpy                                    1.26.4
oauthlib                                 3.2.2
openpyxl                                 3.1.3
opentelemetry-api                        1.22.0
opentelemetry-exporter-otlp-proto-common 1.22.0
opentelemetry-exporter-otlp-proto-http   1.22.0
opentelemetry-proto                      1.22.0
opentelemetry-sdk                        1.22.0
opentelemetry-semantic-conventions       0.43b0
oracledb                                 1.1.1
orjson                                   3.10.4
packaging                                24.0
pandas                                   1.5.3
partd                                    1.4.2
pathspec                                 0.12.1
pillow                                   10.3.0
pip                                      24.0
pip-tools                                7.4.1
platformdirs                             4.2.2
plotly                                   5.22.0
pluggy                                   1.5.0
pre-commit                               3.7.1
prompt_toolkit                           3.0.47
prophet                                  1.1.5
proto-plus                               1.23.0
protobuf                                 3.19.6
psutil                                   5.9.8
psycopg2-binary                          2.9.9
pure-sasl                                0.6.2
py                                       1.11.0
py4j                                     0.10.9.7
pyarrow                                  14.0.2
pyasn1                                   0.6.0
pyasn1_modules                           0.4.0
pyathena                                 2.25.2
pyatlan                                  2.2.4
pycparser                                2.22
pycryptodome                             3.20.0
pydantic                                 2.7.3
pydantic_core                            2.18.4
pydeck                                   0.9.1
Pygments                                 2.18.0
PyHive                                   0.7.0
PyJWT                                    2.8.0
pyodbc                                   5.1.0
pyOpenSSL                                24.1.0
pyparsing                                3.1.2
pyproject_hooks                          1.1.0
pyspark                                  3.5.1
pytest                                   7.4.4
pytest-cov                               3.0.0
pytest-html                              3.2.0
pytest-metadata                          3.1.1
python-dateutil                          2.9.0.post0
python-dotenv                            1.0.1
python-multipart                         0.0.9
pytz                                     2024.1
PyYAML                                   6.0.1
readme-renderer                          32.0
referencing                              0.35.1
requests                                 2.31.0
rich                                     13.7.1
rpds-py                                  0.18.1
rsa                                      4.9
ruamel.yaml                              0.17.40
ruamel.yaml.clib                         0.2.8
s3transfer                               0.10.1
sasl                                     0.3.1
schema                                   0.7.7
scipy                                    1.13.1
setuptools                               69.5.1
shellingham                              1.5.4
six                                      1.16.0
smmap                                    5.0.1
sniffio                                  1.3.1
snowflake-connector-python               3.10.1
soda-core                                3.3.5       /Users/tom/Code/soda-core/soda/core
soda-core-athena                         3.3.5       /Users/tom/Code/soda-core/soda/athena
soda-core-atlan                          3.3.5       /Users/tom/Code/soda-core/soda/atlan
soda-core-bigquery                       3.3.5       /Users/tom/Code/soda-core/soda/bigquery
soda-core-contracts                      3.3.5       /Users/tom/Code/soda-core/soda/contracts
soda-core-db2                            3.3.5       /Users/tom/Code/soda-core/soda/db2
soda-core-denodo                         3.3.5       /Users/tom/Code/soda-core/soda/denodo
soda-core-dremio                         3.3.5       /Users/tom/Code/soda-core/soda/dremio
soda-core-duckdb                         3.3.5       /Users/tom/Code/soda-core/soda/duckdb
soda-core-mysql                          3.3.5       /Users/tom/Code/soda-core/soda/mysql
soda-core-oracle                         3.3.5       /Users/tom/Code/soda-core/soda/oracle
soda-core-pandas-dask                    3.3.5       /Users/tom/Code/soda-core/soda/dask
soda-core-postgres                       3.3.5       /Users/tom/Code/soda-core/soda/postgres
soda-core-redshift                       3.3.5       /Users/tom/Code/soda-core/soda/redshift
soda-core-scientific                     3.3.5       /Users/tom/Code/soda-core/soda/scientific
soda-core-snowflake                      3.3.5       /Users/tom/Code/soda-core/soda/snowflake
soda-core-spark                          3.3.5       /Users/tom/Code/soda-core/soda/spark
soda-core-spark-df                       3.3.5       /Users/tom/Code/soda-core/soda/spark_df
soda-core-sqlserver                      3.3.5       /Users/tom/Code/soda-core/soda/sqlserver
soda-core-teradata                       3.3.5       /Users/tom/Code/soda-core/soda/teradata
soda-core-trino                          3.3.5       /Users/tom/Code/soda-core/soda/trino
soda-core-vertica                        3.3.5       /Users/tom/Code/soda-core/soda/vertica
sortedcontainers                         2.4.0
sqlparse                                 0.5.0
stanio                                   0.5.0
starlette                                0.37.2
streamlit                                1.35.0
tabulate                                 0.8.10
tblib                                    3.0.0
tbump                                    6.11.0
tenacity                                 8.2.3
teradatasql                              20.0.0.12
thrift                                   0.16.0
thrift-sasl                              0.4.3
toml                                     0.10.2
tomli                                    2.0.1
tomlkit                                  0.11.8
toolz                                    0.12.1
tornado                                  6.4.1
tox                                      3.28.0
tox-docker                               4.1.0
tqdm                                     4.66.4
trino                                    0.328.0
typeguard                                4.3.0
typer                                    0.12.3
typing_extensions                        4.11.0
tzdata                                   2024.1
tzlocal                                  5.2
ujson                                    5.10.0
Unidecode                                1.3.8
urllib3                                  1.26.18
uvicorn                                  0.30.1
uvloop                                   0.19.0
vertica-python                           1.3.8
virtualenv                               20.26.2
watchfiles                               0.22.0
wcwidth                                  0.2.13
webencodings                             0.5.1
websocket-client                         1.8.0
websockets                               12.0
wheel                                    0.43.0
wrapt                                    1.16.0
zict                                     3.0.0
zipp                                     3.19.2

Hello,
Installing version 3.3.5 of soda-core-contracts, soda-core-spark-df & pyspark 3.3.6 worked and now the checks are getting executed. But still I found the following issues:

I'm trying to read data from a CSV file with the columns EmployeeID and First Name. This is how my contracts file looks like:

dataset: emp_duplicate

columns:
  - name: EmployeeID
    checks:
      - type: no_duplicate_values
        name: "Column = EmployeeID, Check = No Duplicates"
      - type: no_missing_values
        name: "Column = EmployeeID, Check = No Missing Values"

  - name: First Name
    checks:
      - type: no_missing_values
        name: "Column = First Name, Check = No Missing Values"

      - type: no_invalid_values
        valid_min_length: 4
        name: "Column = First Name, Check = Min Length Check"

      - type: no_invalid_values
        valid_min_length: 1
        name: "Column = First Name, Check = Empty String Check"

When I run the checks I get the following errors:

error  | SodaCL: Invalid check "missing_count(First Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error  | SodaCL: Invalid check "invalid_count(First Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error  | SodaCL: Invalid check "invalid_count(First Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error  | SodaCL: Invalid check "invalid_count(Last Name) = 0": mismatched input 'Name' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error  | SodaCL: Invalid check "duplicate_count(Phone No) = 0": mismatched input 'No' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error  | SodaCL: Invalid check "missing_count(Phone No) = 0": mismatched input 'No' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
error  | SodaCL: Invalid check "invalid_count(Phone No) = 0": mismatched input 'No' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}

I Think the issue is the column name having two words separated by a whitespace. Syntactically that does not make a difference for the YAML file, but when Soda is internally trying to execute the SQL query on the column it's encountering an error due to the column name seperated by a whitespace. So I tried to enclose the all the column names in the YAML contracts file with double quotes as follows:

dataset: emp_duplicate

# DQ Checks
# if the column name has two words seperated by a whitespace,
# make sure to enclose the column name in double quotes
columns:
  - name: "EmployeeID"
    checks:
      - type: no_duplicate_values
        name: "Column = EmployeeID, Check = No Duplicates"
      - type: no_missing_values
        name: "Column = EmployeeID, Check = No Missing Values"

  - name: "First Name"
    checks:
      - type: no_missing_values
        name: "Column = First Name, Check = No Missing Values"

      - type: no_invalid_values
        valid_min_length: 4
        name: "Column = First Name, Check = Min Length Check"

And then I encounter the follwing error:

error  | SodaCL: Query error: spark_ds."EmployeeID".failed_rows[duplicate_count]: 
[PARSE_SYNTAX_ERROR] Syntax error at or near '"EmployeeID"'.(line 10, pos 25)

== SQL ==

WITH frequencies AS (
    SELECT "EmployeeID"
    FROM emp_duplicate
    WHERE "EmployeeID" IS NOT NULL
    GROUP BY "EmployeeID"
    HAVING COUNT(*) > 1)
SELECT main.*
FROM emp_duplicate main
JOIN frequencies ON main."EmployeeID" = frequencies."EmployeeID"
-------------------------^^^

LIMIT 100

I think this time It's because of the column EmployeeID being enclosed in double quotes and soda is internally quoting it with single quotes as indicated by this line of error: Syntax error at or near '"EmployeeID"'.(line 10, pos 25)

So what I did was I converted all the headings of the CSV to camel case so that there would be no whitespaces in the column headings & there will be no need to quote the column headings with double quotes. Then my contracts YAML file looked like this:

dataset: emp_duplicate

# DQ Checks
# if the column name has two words seperated by a whitespace,
# make sure to enclose the column name in double quotes
columns:
  - name: employee_id
    checks:
      - type: no_duplicate_values
        name: "Column = EmployeeID, Check = No Duplicates"
      - type: no_missing_values
        name: "Column = EmployeeID, Check = No Missing Values"

  - name: first_name
    checks:
      - type: no_missing_values
        name: "Column = First Name, Check = No Missing Values"

by doing this the scan ran successfully without throwing any errors:

# Contract results for emp_duplicate
9 check failures and 0 execution errors
Check FAILED [Column = EmployeeID, Check = No Duplicates]
  Expected duplicate_count(employee_id) = 0
  Actual duplicate_count(employee_id) was 3
Check FAILED [Column = First Name, Check = No Missing Values]
  Expected missing_count(first_name) = 0
  Actual missing_count(first_name) was 1
Check FAILED [Column = First Name, Check = Min Length Check]
  Expected invalid_count(first_name) = 0
  Actual invalid_count(first_name) was 2

So far we have got a temporary fix for the problem by renaming the column names in came case. But it would be great if you guys could have a look at the issue and fix it

Thanks a lot 👍

Impressive 💪 Thanks for the feedback! I'll put it on the backlog to test for names with spaces and quoting.

Hello Tom,

In our team we are trying to automate implementation of Soda's Data Contract and generation of report.
By using the Soda Data Contract Python API, we were able to produce the verification of results by following the documentation. But the results that we get as you know are in the form of text logs like one below:

# Contract results for emp_duplicate
42 check failures and 0 execution errors
Check FAILED [Column = EmployeeID, Check = No Duplicates]
  Expected duplicate_count(employee_id) = 0
  Actual duplicate_count(employee_id) was 2
Check FAILED [Column = EmployeeID, Check = No Missing Values]
  Expected missing_count(employee_id) = 0
  Actual missing_count(employee_id) was 1
.
.
.

What we wanted was to extract the data contract check statistics in the form of object that would provide us with the following information:

  1. The Soda Contract Check that was performed on a column
  2. The column name upon which the check was performed
  3. Total no of rows under the column that were scanned during the execution of the check or if it's a table level check then total records scanned din the table
  4. total no of records that failed the check

So that the following check result log:

Check FAILED [Valid format for Email ID]
  Expected invalid_count(email) = 0
  Actual invalid_count(email) was 1

would look something like this in the form of an object or a python dictionary:

{
    "column_name": "email_id",
    "check_performed": "no_invalid_value",
    "configuration_key": "valid_format",
    "configuration_key_value": "email",
    "no_of_records_scanned": "16",
    "no_of_failures": 1
}

Now we can covert the above results as a single record in a panda's data frame
Hence capturing all the check results like this we can form a table containing results of the execution & that can also act as our report.
Something like this

column_name check_performed configuration_key configuration_key_value no_of_records_scanned no_of_failures
email_id no_invalid_value valid_format email 16 1
dob no_invalid_value valid_format date eu 16 2

But the appropriate methods to get the check results in the form that I mentioned above is not available in the Documentation.
Therefore I'm using my Vs Code IDE prompts and searching through the source code for ways I can achieve it. So far, I'm not able to find any success. Can you please help me out. Is there any other source of reference. We are on a tight deadline and have to complete the project. We can even connect over mail or something if if you want. My mal Id: shrivatsa307@gmail.com

Hi @shrivatsashetty -- thanks for the detail and the offer to connect. I'm sure Tom will see the above comment as well, but you can also find him on Slack in our Soda Community: https://community.soda.io/slack Cheers!