guiaech / flattener

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README

Google Analytics 4 Flattener. A Google Cloud Platform (GCP) solution that unnests ( flattens) Google Analytics Data 4 stored in Bigquery.
The GCP resources for the solutions are installed via Deployment Manager.

Python package

Contents

Background

  • Google Analytics 4 (GA4) is the newest generation of Google Analytics.
  • When you export data from GA4 into BigQuery, the schema will have nested and repeated fields.
    • Nested data represents a struct. In BigQuery, it can be described as a column inside of a column.
    • Repeated data represents an array. In BigQuery, it can be described as a row inside of a row.

Problem

  • Schema with nested and repeated data:
    • Can make it harder to query the data (it takes a learning curve to get used to it).
    • May be incompatible with other data systems. It maybe be impossible to import data into another database system or into a spreadsheet.

Purpose

  • The purpose of GA4 Flattener is to solve the above problems by flattening/unnesting GA4 data.
    • Flattening/unnesting means converting all nested and repeated fields to regular fields (i.e., not a struct or an array).
    • As a result, our data becomes simpler and more easily compatible with other systems.
    • To imagine flat data, picture a Google Spreadsheet / Microsoft Excel table.
  • GA4 Flattener keeps your original GA4 data with nested and repeated fiels, but it also writes 4 flat tables into the same BigQuery dataset.

Local dependencies

Prerequisites

  1. Browse to https://cloud.console.google.com to create Google GCP project or use an existing project that has Google Analytics data flowing to it. Referred to as [PROJECT_ID].

  2. Grant the installing user (you most likely) the pre-defined IAM role of "Owner".

  3. As the installing user for [PROJECT_ID], enable the following APIs

    • Cloud Build API
    • Cloud Deployment Manager V2 API
    • Cloud Functions API
    • Identity and Access Management (IAM) API
  4. As the installing user for [PROJECT_ID], grant the following pre-defined IAM roles to [PROJECT_NUMBER]@cloudservices.gserviceaccount.com (built in service account) otherwise deployment will fail with permission errors. See https://cloud.google.com/deployment-manager/docs/access-control for detailed explanation.

    • Logs Configuration Writer
    • Cloud Functions Developer
    • Pub/Sub Admin
  5. As the installing user for [PROJECT_ID], create a bucket or use an existing bucket for staging code, for example: [PROJECT_NUMBER]-function-code-staging. Referred to as [BUCKET_NAME].

If your GCP project is brand new, you might not have [PROJECT_NUMBER]@cloudservices.gserviceaccount.com yet. To fix this, enable Compute Engine API and then disable it. The service account [PROJECT_NUMBER] @cloudservices.gserviceaccount.com will appear in your GCP project under IAM.

  1. Clone this github repo or download the source code from the releases section to your local machine or cloud shell.

  2. Edit the ga_flattener.yaml and ga_flattener_colon.yaml files, specifically all occurrences of properties--> codeBucket value. Set the value to [BUCKET_NAME] (see step above)

Backfilling prerequisites

The following steps are only required if you plan to backfill historical tables.

  1. Install python 3.7 or higher

  2. From Mac Terminal or Windows Command Prompt, upgrade pip:

    Mac:

    python3 -m pip install --upgrade pip

    Windows:

    py -m pip install --upgrade pip

  3. Navigate to the root directory of the source code that was downloaded or cloned in step 6 above.

  4. From a command prompt, install python virtual environments:

    Mac:

    python3 -m pip install --user virtualenv

    Windows:

    py -m pip install --user virtualenv

  5. Create a virtual environment for the source code in step 6:

    Mac:

    python3 -m venv venv_ga_flattener

    Windows:

    py -m venv venv_ga_flattener

  6. Activate the virtual environment in the step above:

    Mac:

    source venv_ga_flattener/bin/activate

    Windows:

    .\venv_ga_flattener\Scripts\activate

  7. Install the python dependent packages into the virtual environment:

    Mac:

    pip install -r cf/requirements.txt

    Windows:

    pip install -r cf/requirements.txt

Installation steps

  1. Execute command in Google Cloud SDK Shell: gcloud config set project [PROJECT_ID]

  2. Execute command: gcloud config set account username@domain.com. Note - This must be the installing user from above prerequisites.

  3. Navigate (locally) to root directory of this repository

  4. If [PROJECT_ID] does NOT contain a colon (:) execute command:

    gcloud deployment-manager deployments create **[deployment_name]** --config ga_flattener.yaml

    otherwise follow these steps:

    1. execute command:

      gcloud deployment-manager deployments create **[deployment_name]** --config ga_flattener_colon.yaml

    2. Trigger function (with a blank message) named [deployment_name]-cfconfigbuilderps. It will create the necessary configuration file in the applications Google Coud Storage bucket. An easy method to do this is to browse to https://console.cloud.google.com/functions and click the cloud function named [deployment_name] -cfconfigbuilderps and go to the testing section and click "TEST THIS FUNCTION".

    Deployment naming conventions

    • Deployment name must be a match of regex '[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?'
    • Deployment naming conventions are as follows:
      • Starts with a lowercase letter
      • Can contain lowercase letters, numbers and hyphens
      • Ends with a lowercase letter or number
      • Up to 63 characters long (a beginning lowercase letter + 61 chars (hyphens, lowercase letters and number) + the ending lowercase letter or a number)
    • Note that deployment name cannot have underscores in its name, but can have hyphens.
    • An example of a valid name is in this command: gcloud deployment-manager deployments create ga-flattener-deployment --config ga_flattener.yaml
    • Please refer to the documentation for more examples of valid deployment names.

Verification steps

Config file

  1. After installation, a configuration file named config_datasets.json exists in gs://[deployment_name] -[PROJECT_NUMBER]-adswerve-ga-flat-config/ (Cloud Storage Bucket within [PROJECT_ID]).
  • This file contains all the datasets that have events_yyyymmdd tables and which tables to unnest.

  • This configuration is required for this GA4 flattener solution to run daily or to backfill historical data.

  • Edit this file accordingly to include or exclude certain datasets or tables to unnest.

  • You can also enable intraday flattening via this config file and specify its frequency in hours or minutes.

Enabling intraday flattening in the config file

  • In addition to daily tables events_yyyymmdd, you may also have the table events_intraday_yyyymmdd, which refreshes every few minutes.

  • By default, the flattener does not flatten the intraday table.

  • You can enable intraday flattening by editing the config file and supplying "intraday_schedule": {"frequency": "your_frequency", "units": "your_units"}.

    • "your_units" can be "hours" or "minutes"
    • "your_frequency" is an integer.
    • if your units are minutes, then the frequency should be between 1 and 59.

Enabling partitioned output in the config file

  • By default, the flattener produces flat sharded tables, because the original GA tables are also sharded.

  • You can enable the following in the config file:

    A) sharded output

    B) paritioned output

    C) both sharded and partitioned output

  • You configure output type by changing the following part of the config file: "output": {"sharded": true, "partitioned": false}.

Examples of a config file

Example 1 - default config

{
  "analytics_123456789": {
    "tables_to_flatten": [
      "events",
      "event_params",
      "user_properties",
      "items"
    ],
    "intraday_schedule": {
      "frequency": null,
      "units": "hours"
    },
    "output": {
      "sharded": true,
      "partitioned": false
    }
  }
}
  • This is the default config file.
  • 4 flat tables will be created.
  • There will be no flattening of the intraday table.
  • You may notice that "intraday_schedule" is not necessary, but it provides a template in case you do want intraday flattening.
  • Only sharded output is produced.

Example 2 - excluding tables from flattening.

{
 "analytics_123456789": {
   "tables_to_flatten": [
     "events",
     "event_params"
   ],
   "intraday_schedule": {
     "frequency": null,
     "units": "hours"
   },
   "output": {
     "sharded": true,
     "partitioned": false
   }
 }
}
  • This config file will only create 2 flat tables for one GA4 property.
  • There will be no intraday flattening.
  • Only sharded output is produced.

Example 3 - adding more datasets, intraday flattening and partitioned output.

{
 "analytics_123456789": {
   "tables_to_flatten": [
     "events",
     "event_params",
     "user_properties",
     "items"
   ],
   "intraday_schedule": {
     "frequency": 15,
     "units": "minutes"
   },
   "output": {
     "sharded": true,
     "partitioned": true
   }
 },
 "analytics_987654321": {
   "tables_to_flatten": [
     "events",
     "event_params",
     "user_properties",
     "items"
   ],
   "intraday_schedule": {
     "frequency": 1,
     "units": "hours"
   },
   "output": {
     "sharded": true,
     "partitioned": true
   }
 }
}
  • This config file will create all 4 flat tables for each of the 2 GA4 properties.

  • In both properties, we will also do intraday flattening.

  • The flattened intraday tables will refresh every 15 minutes for the 1st dataset and every hour for the 2nd dataset.

  • In both properties, we create both shaded and partitioned output.

  • See another example in ./sample_config/config_datasets_sample.json in this repo.

Backfilling steps

The following steps are only required if you plan to backfill historical tables.

  1. Modify values in the configuration section of tools/pubsub_message_publish.py accordingly. Suggestion: Use a small date range to start, like yesterday only.
  2. From a gcloud command prompt, authenticate the installing user using command: gcloud auth application-default login
  3. Run tools/pubsub_message_publish.py locally, which will publish a simulated logging event of GA4 data being ingested into BigQuery. Check dataset(s) that are configured for new date sharded tables such as (depending on what is configured):
    • flat_event_params_yyyymmdd
    • flat_events_yyyymmdd
    • flat_items_yyyymmdd
    • flat_user_properties_yyyymmdd

Un-install steps

  1. Delete the config_datasets.json file from gs://[deployment_name]-[PROJECT_NUMBER]-adswerve-ga-flat-config/ (Cloud Storage Bucket within [PROJECT_ID])

    • You can do this in GSC (Google Cloud Storage) UI or via command line: gsutil rm gs://[deployment_name]-[PROJECT_NUMBER]-adswerve-ga-flat-config/config_datasets.json
  2. Remove solution:

    • gcloud deployment-manager deployments delete **[deployment_name]** -q
  3. If you were doing intraday flattening, manually delete a Cloud Scheduler job(s) in the UI

Common errors

Install

  • Message: AccessDeniedException: 403 [PROJECT_NUMBER]@cloudbuild.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.

  • Resolution: Ensure the value (Cloud Storage bucket name) configured in "codeBucket" setting of ga_flattener* .yaml is correct. [PROJECT_NUMBER]@cloudbuild.gserviceaccount.com only requires GCP predefined role of Cloud Build Service Account

Verification

  • Message: google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

  • Resolution: Ensure you run the gcloud command gcloud auth application-default login as this sets up the required authentication and it'll just work.

Repository directories

  • cf : Pub/Sub triggered cloud function that executes a destination query to unnest(flatten) the analytics_{ga4_property_id}.events_yyyymmdd table immediately upon arrival in BigQuery into these tables, depending on the configuration:
    • flat_event_params_yyyymmdd
    • flat_events_yyyymmdd
    • flat_items_yyyymmdd
    • flat_user_properties_yyyymmdd
  • tests : units test for both cloud functions and deployment manager templates
  • cfconfigbuilder(ps) : cloud function that finds all BigQuery datasets that have an events table and adds them to the default configuration on Google's Cloud Storage in the following location: [deployment_name]-[PROJECT_NUMBER]-adswerve-ga-flat-config\config_datasets.json

Repository files

  • dm_helper.py: provides consistent names for GCP resources accross solution. Configuration and constants also found in the class in this file
  • dmt-*: any files prefixed with dmt_ are python based Deployment Manager templates
  • ga_flattener.yaml: Deployment Manager configuration file. The entire solution packages in this file. Used in the deployment manager create command
  • tools/pubsub_message_publish.py : python based utility to publish a message to simulate an event that's being monitored in GCP logging. Useful for smoke testing and back-filling data historically.
  • LICENSE: BSD 3-Clause open source license

Running unit tests

  • To run unit tests on your local machine, save an sa.json with GCP account credentials in the sandbox directory.
  • However, it's not necessary, you can also auth locally with gcloud auth application-default login.
  • GitHub CI/CD pipeline uses automatically saves sa.json into the sandbox directory and sets GOOGLE_APPLICATION_CREDENTIALS to the filepath.
  • Set the environment variables in test_base.py: "deployment", "project" and "username".
  • Unit tests run locally or via GitHub CI/CD workflow assume that GA4 flattener is installed into your GCP project.

About

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%