dqops / dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

Home Page:https://dqops.com/docs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Able to host in our own GCP Org?

cachatj opened this issue · comments

Looking at what you are offering here & think it's wonderful. Kudos for the level of detail & knowledge sharing put in across the board - not unnoticed!

I am Dir of Data Science & MLops with an org committed to automating & proactively responding to data incidents. We are a GCP, cloud-first group.

My question is, rather than joining DQOPS GCP resources, are we able to set up shop within our organization? Meaning not just the front end, but everything??

I haven't been able to find anything speaking to that on the resources you have share - but if that's a possibility I'd love to learn more.

I'm guessing the initial reaching is something like, it's not just one service or tool it's integrated widely across the board. Compute, messaging, hosting etc...which would be understandable certainly.

That said, this is what we do on a daily basis (manage massive data & provide that to customers end-to-end. I'm not worried about handing to /stand up, maintain & sustain on our own.

So if you could share a bit about what that would look like, or perhaps we could jump on a call to do the same.

Appreciate it! 🔥

DQOps has two parts.
The big part that you can host anywhere is the DQOps engine itself which has the UI, data quality rules, and an offline data lake with parquet files. You can run that in your Org any time.

The remaining part that would be hosted as a commercial SaaS platform is the data warehouse and the data quality dashboards. We are setting up a small data quality data lake on GCP for each user. That is:

  • storage bucket for Parquet files, the same files that at a local DQOps instance stores in the .data/ folder
  • several external tables that are attached to the storage bucket, accessing the parquet files
  • several BigQuery native tables that are loaded with the content.

Besides that, we have a set of dashboards built with Looker Studio. They are connecting to the customer's data warehouse using our Looker Studio Community Connector.

Given that, there are four ways to host DQOps in your environment:

  1. Hybrid mode
    You are hosting DQOps inside your organization, it is querying your data directly.
    However, the data quality data warehouse is hosted by DQOps as a SaaS solution. Your DQOps instance just incrementally uploads parquet files, the data warehouse is refreshed and you are using our Looker Studio Community Connector to see the dashboards.
    That is the easiest option, possible right now.

  2. Fully hosted in your Org
    That would be possible, but will take some work. We would need to set up a copy of our SaaS backend in your environment. It is doable. But something else is more complex. Each dashboard connects to the data quality data warehouse using our Looker Studio Community Connector. You would need to set up your own connector that can access your BigQuery dataset instead of the data warehouse that we are hosting. The connector has 185 lines of code, it just passes the Access Token and forms a query to BigQuery, so it is not a big deal.
    Next, you would have to update all dashboards that we release and switch the data source to your connector. Keep in mind that from now on, you will have to migrate your copy of dashboards every time we release a new version.

  3. Compute and data hosted in your Org, but DQOps SaaS issues access tokens for Looker Studio
    That would be somewhere between option 1 and option 2. Just grant rights to the DQOps SaaS service account to files in your copy of the data warehouse. DQOps SaaS would issue access tokens for the Looker Studio Community Connector, allowing to use of the most recent dashboards.

  4. Fully hosted in your Org, using your dashboards
    In this scenario, you just synchronize the whole .data/ folder to your storage bucket using rsync. You create your own external tables over the files in the .data/ folder. The files in the .data/ folder are structured as tables using a Hive-compatible partitioning format.
    Having a fully local instance and your own data warehouse, you would have to make your own dashboards and replace the URLs in the dashboardslist.dqodashboards.yaml file.