abhirockzz / cosmosdb-synapse-workshop

Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB

In this workshop, you will learn about Azure Synapse Link for Azure Cosmos DB. We will go through some of the notebooks from the official samples repository - https://aka.ms/cosmosdb-synapselink-samples

It will cover:

  • Azure Cosmos DB and how it integrates with Azure Synapse Analytics using Azure Synapse Link.
  • Basic operations such as setting up Azure Cosmos DB and Azure Synapse Analytics. Creating Azure Cosmos DB containers, setting up Linked Service.
  • Scenarios with examples
    • Batch Data Ingestion leveraging Synapse Link for Azure Cosmos DB and performing operations across Azure Cosmos DB containers
    • Streaming ingestion into Azure Cosmos DB collection using Structured Streaming
    • Getting started with Azure Cosmos DB's API for MongoDB and Synapse Link

Learning materials and resources

If you want to go back and learn some of the foundational concepts, the following resources might be helpful:

Pre-requisites

The steps outlined in this section have completed in advance to save time. The following resources are already available:

  • Azure Cosmos DB accounts (Core SQL API, Mongo DB API)
  • Azure Synapse Analytics workspace along with a Apache Spark pool

You can use an existing Azure account or create a free account using this link

First things first, create a Resource Group to host the resources used for this workshop. This will make it easier to manage them and clean-up once you're done.

Azure Cosmos DB

Azure Synapse Analytics

Workshop setup

For Azure Cosmos DB Core SQL API account:

Create Azure Cosmos DB database (named RetailSalesDemoDB) and three containers (StoreDemoGraphics, RetailSales, and Products). Please make sure to:

  • Set the database throughput to Autoscale and set the limit to 40000 instead of 400, this will speed-up the loading process of the data, scaling down the database when it is not in use (check the documentation on how to set throughput)
  • Use /id as the Partition key for all 3 containers.
  • Analytical store is set to On for all 3 containers.

Detailed steps are outlined in the documentation

For Azure Cosmos DB MongoDB API account:

Create a database named DemoSynapseLinkMongoDB along with a collection named HTAP with a Shard key called item. Make sure you set the Analytical store option to On when you create your collection.

Add a Linked Service and static data in Azure Synapse workspace

You're all set to try out the Notebooks!

Scenarios

Batch Data Ingestion leveraging Synapse Link for Azure Cosmos DB

We will go through how to ingest batch data into Azure Cosmos DB using using Synapse using this notebook.

Clone or download the content from the samples repo, navigate to the Synapse/Notebooks/PySpark/Synapse Link for Cosmos DB samples/Retail/spark-notebooks/pyspark directory and import the 1CosmoDBSynapseSparkBatchIngestion.ipynb file into your Azure Synapse workspace

To learn more, read up on how to Create, develop, and maintain Synapse Studio notebooks in Azure Synapse Analytics

Join data across Cosmos DB containers

Clone or download the content from the samples repo, navigate to the Synapse/Notebooks/PySpark/Synapse Link for Cosmos DB samples/Retail/spark-notebooks/pyspark directory and import the 2SalesForecastingWithAML.ipynb file into your Azure Synapse workspace

Streaming ingestion into Azure Cosmos DB collection using Structured Streaming

We will explore this notebook to get an overview of how to work with streaming data using Spark.

Clone or download the content from the samples repo, navigate to the Synapse/Notebooks/PySpark/Synapse Link for Cosmos DB samples/IoT/spark-notebooks/pyspark directory and import the 01-CosmosDBSynapseStreamIngestion.ipynb file into your Azure Synapse workspace

Getting started with Azure Cosmos DB's API for MongoDB and Synapse Link

We will explore this notebook. Since it uses specific Python libraries you will need to upload the requirements.txt file located in Synapse/Notebooks/PySpark/Synapse Link for Cosmos DB samples/E-Commerce/spark-notebooks/pyspark directory to install these to your Spark pool packages.

Here is detailed write-up on how to Manage libraries for Apache Spark in Azure Synapse Analytics

Clone or download the content from the samples repo, navigate to the Synapse/Notebooks/PySpark/Synapse Link for Cosmos DB samples/E-Commerce/spark-notebooks/pyspark directory and import the CosmosDBSynapseMongoDB.ipynb file into your Azure Synapse workspace.

Wrap up and next steps

That's all for this workshop. I encourage you to go through the rest of this lab which includes using AutoML in Azure Machine Learning to build a Forecasting Model.

Important: Delete resources

Delete the Azure Resource Group. This will delete all the resources inside the resource group.

Continue to learn at your own pace!

About

Near Real Time Analytics with Azure Synapse Link for Azure Cosmos DB