yuessir / dp203-azure-data-engineering

notes for DP203-Data Engineering on Azure

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DP-203 Notes:

2. Data Storage:

  • azure storage account
  • Azure Blob Storage vs. Azure Data Lake Storage Gen2
  • storage account -> access keys
  • storage account -> shared access signature (SAS)
  • storage account -> Redundancy
    • LRS
    • ZRS
    • GRS
    • Read-access GRS
    • GZRS
    • Read-access GZRS
  • storage account -> access tiers
    • Hot: frequently access data
    • Cool: infrequently access data
    • archive: not availavble at storage account level, available at indivudual blob level
  • storage account -> Lifecycle management

3. T-SQL

  • when using WHERE clause in a data warehouse -> use PARTITIONS in data warehouse -> to increase efficiency of SQL queries

4. Azure Synapse Analytics

4.1 Azure Synapse
  • features of azure synapse analytics
  • compute options:
    • serverless SQL pool
    • dedicated SQL pool
      • DWU - datawarehousing unit
    • apache spark pool
4.2 External tables & Serverless SQL pool / Dedicated SQL pool
  • steps to create and use external table

    • create a database in synabse workspace
    • create a database master key with encryption by password. this will be used to protect Shared Access Signature
    • use SAS to authroize the use of ADLS account, crate database scoped credential SasToken
    • create external datasource (can be hadoop, blob storage, ADLS)
    • create external file format object that defines the external data (file format = DELIMITEDTEXT or PARQUET)
    • define the external table
    • use the table for analysis (there will be a lag, as data is stored in external source)
    CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
    WITH
    IDENTITY = 'ADLS-name',
    SECRET = 'ACCESS_KEY';
    
    -- In the SQL pool, we can use Hadoop drivers to mention the source
    
    CREATE EXTERNAL DATA SOURCE log_data
    WITH (    LOCATION   = 'abfss://data@ADLSNAME.dfs.core.windows.net',
            CREDENTIAL = AzureStorageCredential,
            TYPE = HADOOP
    )
    
    -- Drop the table if it already exists
    DROP EXTERNAL TABLE [logdata]
    
    -- Here we are mentioning the file format as Parquet
    
    CREATE EXTERNAL FILE FORMAT parquetfile  
    WITH (  
        FORMAT_TYPE = PARQUET,  
        DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
    );
    
    -- Notice that the column names don't contain spaces
    -- When Azure Data Factory was used to generate these files, the column names could not have spaces
    
    CREATE EXTERNAL TABLE [logdata]
    (
        [Id] [int] NULL,
        [Correlationid] [varchar](200) NULL,
        [Operationname] [varchar](200) NULL,
        [Status] [varchar](100) NULL,
        [Eventcategory] [varchar](100) NULL,
        [Level] [varchar](100) NULL,
        [Time] [datetime] NULL,
        [Subscription] [varchar](200) NULL,
        [Eventinitiatedby] [varchar](1000) NULL,
        [Resourcetype] [varchar](1000) NULL,
        [Resourcegroup] [varchar](1000) NULL
    )
    WITH (
    LOCATION = '/parquet/',
        DATA_SOURCE = log_data,  
        FILE_FORMAT = parquetfile
    )
    
    /*
    A common error can come when trying to select the data, here you can get various errors such as MalformedInput
    
    You need to ensure the column names map correctly and the data types are correct as per the parquet file definition
    
    */
    
    
    SELECT * FROM [logdata]
4.3 Loading data into data warehouse (SQL pool)
  • using T-SQL COPY statement

  • using azure Synapse pipeline, can perform tranformations on data before copying the data to the warehouse

  • using Polybase to define external tables, use external tables to create the internal tables

  • 1 - load data using COPY statement

    • never use the admin account for load operations
    • create a seperate user for load operations
    • best practice - create a workload group - to segregate CPU percentage across groups of users
    • grant permissions
    • csv:
      COPY INTO logdata FROM 'https://appdatalake7000.blob.core.windows.net/data/Log.csv'
      WITH
      (
      FIRSTROW=2
      )
    • parquet:
      COPY INTO [logdata] FROM 'https://jibsyadls.blob.core.windows.net/data/raw/parquet/*.parquet'
      WITH
      (
      FILE_TYPE='PARQUET',
      CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='sv=2021-06-08&ss=b&srt=sco&sp=rl&se=2022-12-22T14:08:01Z&st=2022-12-22T06:08:01Z&spr=https&sig=WU%2FFh62PcCSx7wSEuccKC%2FdlgAwIto2aHJVXMiPovfM%3D')
      )
  • 2 - load data using external table using polybase

    CREATE TABLE [logdata]
    WITH
    (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED INDEX (id)   
    )
    AS
    SELECT  *
    FROM  [logdata_external];
  • 3 - BULK INSERT from Azure Synapse

    • in azure data studio -> connect to external data
    • select ADLS gen2 (new linked service --- part of azure data factory)
    • creating connection to data store
    • fill the details and create
    • before copying data -> specific permissions have to be given to the storage account
    • go to Access Control in the data storage account
    • add a role assignment
    • role = storage blob data contributor (allows user to read and write data)
    • choose azure admin account & save
    • go to azure data studio -> linked -> select connected ADLS -> select the file
    • select the file, right click and select bulk load
    • this wil automatically create the SQL script
4.4 Designing a data warehouse
  • fact table
    • contains measurable facts
    • usually large in size
  • dimension table
  • note: there is no concept of foreign key in sql data warehouse in dedicated SQL pool in azure synapse
  • star schema
  • ideal practice while building dimension tables:
    • dont have NULL values for properties in dimension table, wont give desired results when using reporting tools
    • try to replace NULL with some default value
  • surrogate key: new key added in dimension table when mixing two different data sources with same primary keys
  • can use "Identity column" feature in azure synapse to generate unique ID
  • right approach -> take different tables & create fact table in synapse itself using ADF to migrate tables from azure database to azure synapse
4.5 Transfer data from azure sql database to azure synapse
  • create table structure in synapse
  • open synapse studio -> Integrate -> Copy data tool
  • make connection with the source (azure sql database) - select the server, database and table details
  • make connection to the target (azure synapse) - select the required details
  • select the staging area in ADLS2 or blob - used by the copy statement
4.6 Reading JSON files from ADLS/Blob
-- Here we are using the OPENROWSET Function

SELECT TOP 100
    jsonContent
FROM
    OPENROWSET(
        BULK 'https://appdatalake7000.dfs.core.windows.net/data/log.json',
        FORMAT = 'CSV',
        FIELDQUOTE = '0x0b',
        FIELDTERMINATOR ='0x0b',
        ROWTERMINATOR = '0x0a'
    )
    WITH (
        jsonContent varchar(MAX)
    ) AS [rows]

-- The above statement only returns all as a single string line by line
-- Next we can cast to seperate columns

SELECT 
   CAST(JSON_VALUE(jsonContent,'$.Id') AS INT) AS Id,
   JSON_VALUE(jsonContent,'$.Correlationid') As Correlationid,
   JSON_VALUE(jsonContent,'$.Operationname') AS Operationname,
   JSON_VALUE(jsonContent,'$.Status') AS Status,
   JSON_VALUE(jsonContent,'$.Eventcategory') AS Eventcategory,
   JSON_VALUE(jsonContent,'$.Level') AS Level,
   CAST(JSON_VALUE(jsonContent,'$.Time') AS datetimeoffset) AS Time,
   JSON_VALUE(jsonContent,'$.Subscription') AS Subscription,
   JSON_VALUE(jsonContent,'$.Eventinitiatedby') AS Eventinitiatedby,
   JSON_VALUE(jsonContent,'$.Resourcetype') AS Resourcetype,
   JSON_VALUE(jsonContent,'$.Resourcegroup') AS Resourcegroup
FROM
    OPENROWSET(
        BULK 'https://appdatalake7000.dfs.core.windows.net/data/log.json',
        FORMAT = 'CSV',
        FIELDQUOTE = '0x0b',
        FIELDTERMINATOR ='0x0b',
        ROWTERMINATOR = '0x0a'
    )
    WITH (
        jsonContent varchar(MAX)
    ) AS [rows]
4.6 Azure Synapse Architecture
  • there are 60 distributions
  • data is shared into distributions to optimize the performance of work
  • data and compute are seperate, they can scale independently
  • control node - optimizes the query for parallel processing
  • work is then passed to the compute nodes, these nodes will do the work in parallel
4.7 Types of tables
  • Round-robin distributed tables:

    • data is distributed randomly
    • default distribution while creating tables
    • best for temporary or staging tables
    • If there are no joins performed on tables, then you can consider using this table type
    • Also, if there is no clear candidate column for hash distributing the table.
  • Hash-distributed tables:

    • data is distributed based on HASH()
    • good for large tables - fact tables
    • while choosing distribution column: - ensure it has many unique values - data gets spread across more distributions - if not, it may result in DATA SKEW - dont use date column - does not have NULLS or very few NULLS - is used in JOIN, GROUP BY and HAVING clauses - is not used in the WHERE clause
    •      CREATE TABLE [dbo].[SalesFact](
           [ProductID] [int] NOT NULL,
           [SalesOrderID] [int] NOT NULL,
           [CustomerID] [int] NOT NULL,
           [OrderQty] [smallint] NOT NULL,
           [UnitPrice] [money] NOT NULL,
           [OrderDate] [datetime] NULL,
           [TaxAmt] [money] NULL
           )
           WITH  
           (   
               DISTRIBUTION = HASH (CustomerID)
           )
  • Replicated tables:

    • full copy of table is cached on every distribution (compute node)
    • good for dimension tables
    • ideal for tables less than 2 GB
    • not ideal for tables with frequent insert, update and delete
    • Use replicated tables for queries with simple query predicates, such as equality or inequality
    • Use distributed tables for queries with complex query predicates, such as LIKE or NOT LIKE
    •   CREATE TABLE [dbo].[SalesFact](
        [ProductID] [int] NOT NULL,
        [SalesOrderID] [int] NOT NULL,
        [CustomerID] [int] NOT NULL,
        [OrderQty] [smallint] NOT NULL,
        [UnitPrice] [money] NOT NULL,
        [OrderDate] [datetime] NULL,
        [TaxAmt] [money] NULL
        )
        WITH  
        (   
            DISTRIBUTION = REPLICATE
        )
  • If we are not using hash-distributed tables for fact tables & replicated tables for dimension tables, while performing JOINs or any other operations - data has to be moved from one distribution to the other distribution. this operation is called as "DATA SHUFFLE MOVE OPERATION" - this may lead to time lag for very big tables.

4.8 Surrogate keys for dimension tables
  • surrogate key == non-business key
  • simple incrementing integer values
  • in SQL pool tables, use IDENTITY column feature
CREATE TABLE [dbo].[DimProduct](
	[ProductSK] [int] IDENTITY(1,1) NOT NULL,
	[ProductID] [int] NOT NULL,
	[ProductModelID] [int] NOT NULL,
	[ProductSubcategoryID] [int] NOT NULL,
	[ProductName] varchar(50) NOT NULL,
	[SafetyStockLevel] [smallint] NOT NULL,
	[ProductModelName] varchar(50) NULL,
	[ProductSubCategoryName] varchar(50) NULL
)
  • in synapse studio integrate data copy method -> the Identity column - not oncremented one by one - but by number of distributions
  • ADF can properly create incremental nubers in IDENTIY column
4.9 Slowly changing dimensions
  • type-1 SCD: updates the OLD value with the NEW value in the data warehouse
  • type-2 SCD: keeps both OLD and NEW values (start_date and end_date and is_active)
  • type-3 SCD: instead of having multiple rows, additional columns are added to signify the change
4.10 Heap tables
CREATE TABLE [dbo].[SalesFact_staging](
	[ProductID] [int] NOT NULL,
	[SalesOrderID] [int] NOT NULL,
	[CustomerID] [int] NOT NULL,
	[OrderQty] [smallint] NOT NULL,
	[UnitPrice] [money] NOT NULL,
	[OrderDate] [datetime] NULL,
	[TaxAmt] [money] NULL
)
WITH(HEAP,
DISTRIBUTION = ROUND_ROBIN
)

CREATE INDEX ProductIDIndex ON [dbo].[SalesFact_staging] (ProductID)
  • this does not create a clustered column store table
  • clustered column store table: used for final tables
  • for temporary tables - HEAP tables are prefered
  • In heap tables - no option to create clustered column store INDEX
  • so, we can create a non-clustered INDEX using CREATE INDEX
4.11 Partitions
-- Let's create a new table with partitions
CREATE TABLE [logdata]
(
    [Id] [int] NULL,
	[Correlationid] [varchar](200) NULL,
	[Operationname] [varchar](200) NULL,
	[Status] [varchar](100) NULL,
	[Eventcategory] [varchar](100) NULL,
	[Level] [varchar](100) NULL,
	[Time] [datetime] NULL,
	[Subscription] [varchar](200) NULL,
	[Eventinitiatedby] [varchar](1000) NULL,
	[Resourcetype] [varchar](1000) NULL,
	[Resourcegroup] [varchar](1000) NULL
)
WITH
(
PARTITION ( [Time] RANGE RIGHT FOR VALUES
            ('2021-04-01','2021-05-01','2021-06-01')

   )  
)

Switching partitions

ALTER TABLE [logdata] SWITCH PARTITION 2 TO [logdata_new] PARTITION 1;
4.12 Indexes
  • Clusterd Columnstore Indexes
  • Heap tables
  • Clustered Indexes
  • NonClustered Indexes

5. Design and Develop Data Processing - Azure Data Factory

5.1 Azure Data Factory
  • cloud-based ET tool
  • data-driven orchestrated workflows

ADF components:

  • Linked Service: can create required compute resources to enable ingestion of data from the source

  • Datasets: represents the data structure within the data store that is being referenced by the Linked Service object

  • Activity: contains the actual transformation logic

  • azure pipeline will create compute infrastructure known as Integration runtime - responsible for taking data from source and copying it to the destination

5.2 Mapping Data Flows
  • This helps to visualize the data transformations in Azure Data Factory.
  • Here you can write the required transformation logic without actually writing any code.
  • The data flows are run on Apache Spark clusters.
  • Here Azure Data Factory will handle the transformations in the data flow.
  • Debug mode – You also have a Debug mode in place. Here you can actually see the results of each transformation.
  • In the debug mode session, the data flow is run interactively on a Spark cluster.
  • minimum cluster size to run a Data Flow is 8 vCores.
5.3 Self-Hosted Integration runtime
  • if the database in own custom system sitting inside a VM
  • install the integration runtime on VM
  • register the server with the data factory

6. Azure Event Hubs and Streaming Analytics

6.1 Azure Event Hubs
  • big data streaming platform
  • can receive and process millions of events per second
  • can stream log data, telemetry data, any sort of events to azure events hub
  • event hubs namespace -> event hubs
  • event hubs - multiple partitions - ingest more data at a time - event receivers can take data from one partition or multiple partitions - helps event receivers to consume data in faster rate

Components of Azure event hubs:

  • event producers: entity that sends data to event hub - events can be published using the protocols - HTTPS, AMQP, Apache Kafka
  • partitions: data is split across partitions - allows for better throughput of data onto event hubs
  • consumer groups: view (state, position or offset) of an entire event hub
  • throughput: controls the throughput capacity of event hubs
  • receivers: entity taht reads event data

7. Spark Pool

7.1 Azure Synapse - Apache Spark pool
  • serverless spark pool

  • not charged on creation of pool

  • charged when underlying jobs are running

  • large datasets and distribute computation across multiple pools

  • driver node and executors

  • spark scala

  • creates RDD - Resilient Distributed Dataset

val dist = sc.parallelize(data)
7.2 Spark Dataset
  • This is a strongly typed collection of domain-specific objects
  • This data can then be transformed in parallel
  • Normally you will perform either transformations or actions on a dataset
  • The transformation will produce a new dataset
  • The action will trigger a computation and produce the required result
  • The benefit of having a Dataset is that you can use powerful transformations on the underlying data
7.3 Spark Dataframe
  • The DataFrame is nothing but a Dataset that is organized into named columns.

  • Its like a table in a relational database.

  • You can construct DataFrames from external files.

  • When it comes to Datasets, the API for working with Datasets is only available for Scala and Java.

  • For DataFrames, the API is available in Scala, Java, Python and R.

  • In the spark pool, the spark instances are created when you connect to a spark pool, create a session and run a job

  • when you submit another job, if there is capacity in the pool and the spark instance has spare capacity, itwill run the 2nd job

  • else, it will crate a new spark instance to run the job

7.4 Spark table
  • stored in metastore of spark pool (HIVE META STORE)
  • not for storing data, just for temporary tables
  • the benefit of spark table: metastore is shared with serverless SQL pool as well
%%spark
val df = spark.read.sqlanalytics("jibsypool.dbo.logdata") 
df.write.mode("overwrite").saveAsTable("logdatainternal")

%%sql
SELECT * FROM logdatainternal
7.5 Spark tables - Creation
  • spark tables are parquet based tables
%%sql
CREATE DATABASE internaldb
CREATE TABLE internaldb.customer(Id int,name varchar(200)) USING Parquet

%%sql
INSERT INTO internaldb.customer VALUES(1,'UserA')

%%sql
SELECT * FROM internaldb.customer


// If you want to load data from the log.csv file and then save to a table
%%pyspark
df = spark.read.load('abfss://data@datalake2000.dfs.core.windows.net/raw/Log.csv', format='csv', header=True)
df.write.mode("overwrite").saveAsTable("internaldb.logdatanew")

%%sql
SELECT * FROM internaldb.logdatanew
  • to delete the database, tables have to be dropped first
7.6 Spark Pool - JSON files
%%spark

val df = spark.read.format("json").load("abfss://data@datalake2000.dfs.core.windows.net/raw/customer/customer_arr.json")
display(df)

// Now we need to expand the courses information

%%spark
import org.apache.spark.sql.functions._
val df = spark.read.format("json").load("abfss://data@datalake2000.dfs.core.windows.net/raw/customer/customer_arr.json")
val newdf=df.select(col("customerid"),col("customername"),col("registered"),explode(col("courses")))
display(newdf)

// Reading the customer object file
%%spark
import org.apache.spark.sql.functions._
val df = spark.read.format("json").load("abfss://data@datalake2000.dfs.core.windows.net/raw/customer/customer_obj.json")
val newdf=df.select(col("customerid"),col("customername"),col("registered"),explode(col("courses")),col("details.city"),col("details.mobile"))
display(newdf)

8. Databricks

8.1 Databricks
  • makes use of apache spark to provide a unified analytics platform
  • creates the underlying compute infra
  • has its own underlying file system - abstraction of an underlying storage layer
  • will install spark by itself - also has comatibility for other libs - ML libs
  • provides workspace - notebooks with collaboration and visualization features
8.2 Azure Databricks
  • completely azure-managed environment
  • makes use of underlying compute infrastructure and virtual networks
  • makes use of azure security - azure active directory and role-based access control

Clusters in Azure Databricks

  • inside cluster - 2 types of nodes

    • worker nodes - perform the underlying tasks
    • driver node - distributes the task to worker nodes
  • 2 types of clusters

    • Interactive clustrer: interactive notebooks and multiple users can use a cluster for collaboration
    • Job cluster: cluster is started when the job has to run, and will be terminated once the job is completed
  • 2 types of Interactive cluster

    • Standard cluster:
      • recommended if you are a single user
      • no fault isolation - if multiple users are using and one user has fault - this might impact workloads of other users
      • resources of a cluster might get allocated to a single workload
      • has support for python, R, SQL and Scala
    • High concurrency cluster:
      • for multiple users
      • fault isolation
      • resources are shared across different user workloads
      • support for python, R and SQL (no scala)
      • table access control: can grant and revke access to data from python and SQL
8.3 Autoscaling a cluster
  • When creating an Azure Databricks cluster, you can specify a minimum and maximum number of workers for the cluster.
  • Databricks will then choose the ideal number of workers to run the job.
  • If a certain phase of your job requires more compute power, the workers will be assigned accordingly.
  • There are two types of autoscaling
    • Standard autoscaling
      • Here the cluster starts with 8 nodes
      • Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minute
      • Scales down exponentially, starting with 1 node
    • Optimized autoscaling
      • This is only available for Azure Databricks Premium Plan
      • Can scale down even if the cluster is not idle by looking at shuffle file state
      • Scales down based on a percentage of current nodes
      • On job clusters, scales down if the cluster is underutilized over the last 40 seconds
      • On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds
8.4 Azure Databricks Table
  • In Azure Databricks, you can also create a database and tables
  • The table is a collection of structured data
  • You can then perform operations on the data that are supported by Apache Spark on DataFrames on Azure Databricks tables
  • There are two types of tables – global and local tables.
  • A global table is available across all clusters
  • A global table is registered in the Azure Databricks Hive metastore or an external metastore
  • The local table is not accessible from other clusters and is not registered in the Hive metastore
8.5 Delta Lake
  • ACID transactions on Spark - Serializable isolation levels ensure that readers never see inconsistent data
  • Scalable metadata handling - Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
  • Streaming and batch unification - A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
  • Schema enforcement - Automatically handles schema variations to prevent insertion of bad records during ingestion.
  • Time travel - Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • Upserts and deletes - Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

9 Security

  • Azure Key Vault - certificates, encryption keys and secrets (passwords and login details)

  • Azure Data Factory – Encryption

    • Azure Data Factory already encrypts data at rest which also includes entity definitions and any data that is cached.
    • The encryption is carried out with Microsoft-managed keys.
    • But you can also define your own keys using the Azure Key vault service.
    • For the key vault, you have to ensure that Soft delete is enabled and the setting of Do Not Purge is also enabled.
    • Also grant Azure Data Factory the key permissions of 'Get', 'Unwrap Key' and 'Wrap Key'
  • Azure Synapse - Data Masking

    • Here the data in the table can be limited in its exposure to non-privileged users.
    • You can create a rule that can mask the data.
    • Based on the rule you can decide on the amount of data to expose to the user.
    • There are different masking rules.
    • Credit Card masking rule – This is used to mask the column that contain credit card details. Here only the last four digits of the field are exposed.
    • Email – Here first letter of the email address is exposed. And the domain name of the email address is replaced with XXX.com.
    • Custom text- Here you decide which characters to expose for a field.
    • Random number- Here you can generate a random number for the field.
  • Azure Synapse - Auditing

    • You can enable auditing for an Azure SQL Pool in Azure Synapse Analytics.
    • This feature can be used to track database events and write them to an audit log.
    • The logs can be stored in an Azure storage account, a Log Analytics workspace and Azure Event Hubs.
    • This helps in regulatory compliance. It helps to gain insights on any anomalies when it comes to database activities.
    • Auditing can be enabled at the data warehouse level or server level.
    • If it is applied at the server level, then it will be applied to all of the data warehouses that reside on the server
  • Azure Synapse - Data Discovery and Classification

    • This feature provides capabilities for discovering, classifying, labelling, and reporting the sensitive data in your databases.
    • The data discovery feature can scan the database and identify columns that contains sensitive data. You can then view and apply the recommendations accordingly.
    • You can then apply sensitivity labels to the column. This helps to define the sensitivity level of the data stored in the column.
  • Row level security:

-- Create a new schema for the security function

CREATE SCHEMA Security;  

-- Create an inline table function
-- The function returns 1 when a row in the Agentcolumn is the same as the user executing the query 
-- (@Agent = USER_NAME()) or if the user executing the query is the Manager user (USER_NAME() = 'Supervisor').

CREATE FUNCTION Security.securitypredicate(@Agent AS nvarchar(50))  
    RETURNS TABLE  
WITH SCHEMABINDING  
AS  
    RETURN SELECT 1 AS securitypredicate_result
WHERE @Agent = USER_NAME() OR USER_NAME() = 'Supervisor';  

-- Create a security policy adding the function as a filter predicate. The state must be set to ON to enable the policy.

CREATE SECURITY POLICY Filter  
ADD FILTER PREDICATE Security.securitypredicate(Agent)
ON [dbo].[Orders] 
WITH (STATE = ON);  
GO

-- Lab - Azure Synapse - Row-Level Security

-- Allow SELECT permissions to the function

GRANT SELECT ON Security.securitypredicate TO Supervisor;
GRANT SELECT ON Security.securitypredicate TO AgentA;  
GRANT SELECT ON Security.securitypredicate TO AgentB;
  • Azure Synapse - Column level security
CREATE USER Supervisor WITHOUT LOGIN;  
CREATE USER UserA WITHOUT LOGIN;  

-- Grant access to the tables for the users

GRANT SELECT ON [dbo].[Orders] TO Supervisor; 
GRANT SELECT ON [dbo].[Orders](OrderID,Course,Quantity) TO UserA; 

About

notes for DP203-Data Engineering on Azure