veneciab / GiT_2023

This is the repo that is used for Girls in Tech conference Databricks streaming workshop

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GiT 2023

Hola! Welcome to today's workshop. Please find the below instructions for the workshop.

INSTRUCTIONS

Agenda of the Workshop:

A quick demo to capture the user messages on real time basis using POSTMAN and running a ML model for sentiment analysis over the collected data using Databricks. Note: Azure Cloud Services had been used for the demo.

Resources:

To run the shared notebooks , the user must have the mentioned resources set up .

  • A Cloud Account with Active Subscription ( AWS, GCP, Azure, for the workshop we are running the demo on Azure Cloud).
  • Any api platform ( POSTMAN is being used for demo)
  • Cloud messaging Services ( We are using the Azure Bus Service for the demo)
  • A databricks workspace.
  • Key vault

Using POSTMAN:

The user will be using the POSTMAN to post messages to the Service Bus for the demo. The user needs to set up a Personal Postman Account for the same.

  • The users are required to open a POSTMAN free account.
  • Create a workspace or can use if there is an external workspace.
  • Create a new collection.
  • Once you created the account ,you need to open a new page/tab clicking on the '+' sign.
  • On the new tab , select the option as "POST"
  • Copy the below curl code and paste in the space against the post

curl --location 'https://databricksstreaming.servicebus.windows.net/topic1/messages'

--header 'Authorization: SharedAccessSignature sr=https%3A%2F%2Fdatabricksstreaming.servicebus.windows.net%2Ftopic1&sig=dCiyGYjowPsQBlw537yHzAzazWK/IGnWOIl2Xj2GGRM%3D&se=101682055697&skn=RootManageSharedAccessKey'

--header 'Content-Type: application/atom+xml;type=entry;charset=utf-8'

--header 'x-ms-retrypolicy: NoRetry'

  • The Header tab will show three additional key as Authorization, Content-Type and x-ms-retrypolicy

  • Please note the Authorization Key Value to set up Postman API is a temporary token and will be invalid after the session.

  • You need to send a message in the Body section and type in the answer for the question as "How is your day going?"

  • Click the Send button.

  • The messages will be sent to the Service Bus.

Signing up for a free Azure account (Optional)

https://azure.microsoft.com/en-us/free/

Create an Azure Service Bus topic and subscription (Optional)

https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-quickstart-topics-subscriptions-portal

Signing up for Databricks account (Optional):

  • Click the Try Databricks
  • The user will be directed towards the account page of Databricks.
  • Create an community edition account by giving user's details and choosing the link below to proceed with community edition for any free trial of databricks.
  • Proceed to Azure databricks account.
  • We also need to create the clusters to run the notebooks, load the respective libraries into the cluster to run the notebook.
  • List of libraries to be loaded as PyPI packages:
    • flax
    • transformers
    • tensorflow
    • emoji
    • azure-servicebus

Creating a notebook in Databricks (Optional) :

While in an active Databricks workspace ,

  • Navigate to Data Science and Engineering space.
  • Click on New button.

[New Button]

  • Navigate to the options and select Notebook from the dropdown
  • A blank notebook will come up.

Creating a cluster in Databricks:

  • In the main page of the Databricks Account.
  • Navigate to Compute option.

Notebook

  • Click on the create cluster on the right side of the screen.

Notebook

  • Fill in the option for a compute , for the below databricks solution the standard cluster properties will be enough.

Cluster Defination

For the demo purpose we have chosen,

  1. Databricks Runtime: 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)
  2. Worker Type : Standard_DS3_V2
  3. Drive Type : Standard_DS3_V2
  4. Terminate after 10 minutes (until its required for a job, as it will incurr cost)
  5. Access Mode : Single User
  • This fetches less cost.

Installing libraries in Cluster

  • Open the Compute Tab as earlier.
  • Open the cluster created by clicking on it.
  • You will find "Libraries" option in the header , Select the "Libearies" Tab.
  • Click on Install Libraries option.
  • Select the PyPI option
  • Insert the name of above libraries (one at a time) in the Package test box.
  • Click the Install

Library

Library Installation

Notebooks:

The following notebook message_processing.py contains the code snippet for the below actions.

Notebook: message_processing

  • Calling the Azure Bus Service to get the messages. (an Azure Service Bus under the same create a topic and subscription to capture the data).
  • Please note that The Azure Bus Service is created earlier to the workshop and a temporary access token will be shared with user , to send the message to the Azure Bus Service.
  • Running a pre-trained transformer model for sentiment analysis. (Note:Pre-trained model are from hugging face which has been trained on previous data , more information can be obtained on the Huggingface Models list)
  • Tagging the data with the orginal messages.
  • Storing the predicted results in the delta lake
  • Visualizations in the dashboard will be demonstrated over the results of the model .

About

This is the repo that is used for Girls in Tech conference Databricks streaming workshop


Languages

Language:HTML 98.6%Language:Jupyter Notebook 1.4%