sethiaarun / mapping-dataflow-to-fabric-with-openai

Convert Azure Mapping dataflow to Microsoft Fabric PySpark Notebook using OpenAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

This tool uses OpenAI API to convert Azure Mapping Dataflow code to (Microsoft Fabric PySpark)[https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook) code using OpenAI API.

The tool will use ADF get REST API to get dataflow script code or File source and OpenAI to convert script code into PySpark Notebook. We need to pass a few input parameters based on the source of the Mapping Dataflow. We can also pass targeted Fabric resources like workspace ID, lakehouse name, and lakehouse ID to set these parameters into notebook metadata.

The tool is not tested with all transformations supported by Azure Mapping dataflow.

OpenAI Privacy.

💁 You can try converting Azure Mapping dataflow using Scala Combinator custom parsers to Fabric Notebook from mapping-data-flow-to-spark

Design

PlantUmlSequeneDiagram.png

Installation

  • Python > 3.10.11
  • pip install -r requirements.txt

Usages

Set following environment variables:

Mandatory

  • OPENAI_API_KEY - Your openap api key

Optional

  • LOG_LEVEL - Optional debug/info, default value from the application is info
  • OPENAI_MODEL - OpenAI model name, default value from the application is gpt-4

⚠️don't forget to read limitation before you run a large conversion

Get DataFlow Script Lines from API

You need to pass following parameters:

  • source=api
  • rg - resource group name
  • dataFlowName - data flow name
  • factoryName - Azure data factory name
  • lakeHouseId - Existing target Microsoft Fabric lakehouse Id
  • lakeHouseName - Existing target Microsoft Fabric lakehouse name
  • workSpaceId - Existing target Microsoft Fabric workspace Id
  • subscriptionId - subscription id
python.exe main.py --kwargs source=api rg=<resource group> dataFlowName=<dataflow name> factoryName=<adf name> \
lakeHouseId=<fabric lakehouse id> lakeHouseName=<fabric lakehouse name> workSpaceId=<fabric workspace id> \
subscriptionId=<azure subscription id> 

Get DataFlow Script Lines from local file

You need to pass following parameters:

  • source=file
  • dataFlowName - data flow name
  • lakeHouseId - Existing target Microsoft Fabric lakehouse Id
  • lakeHouseName - Existing target Microsoft Fabric lakehouse name
  • workSpaceId - Existing target Microsoft Fabric workspace Id
python.exe main.py --kwargs source=file sourceFile=<dataflow script code file path> dataFlowName=<dataflow name>\
lakeHouseId=<fabric lakehouse id> lakeHouseName=<fabric lakehouse name> workSpaceId=<fabric workspace id>

There are two output files will be generated:

  1. Notebook with dataflow name
  2. PySpark code in .py file

Limitation

Since we use ChatCompletion API from OpenAI to generate a desired output or response, we need to consider the length of the input text based on the token limit of the model we choose to use. For example, GPT-4 can handle up to 8,192 tokens per input, while GPT-3 can only handle up to 4,096 tokens. This means that texts that are longer than the token limit of the model will not fit and may be cut off or ignored.

The max_tokens parameter in the ChatCompletion API allows you to limit the length of the input or output generated by the model to a specified number of tokens. Tokens are chunks of text that language models read, and they can be as short as one character or as long as one word, depending on the language and context.

Setting a very low value for max_tokens can result in the response being cut off abruptly, potentially leading to an output that doesn't make sense or lacks context. The max_tokens parameter is a useful tool to control response length, but setting it too low can negatively impact the quality and coherence of the responses.

What does it mean to the user? It means if your generated code length is > 8k, It will be truncated, and the result will not be complete code.

OpenAI is gpt-4-32k has been around for a while, but extremely limited rollout.

Future Scope of work

  1. Integration with Azure OpenAI
  2. How to extend this when mapping dataflow script code is > 8192 tokens?

References

  1. OpenAI API
  2. AI Model tokens
  3. How to count tokens?

About

Convert Azure Mapping dataflow to Microsoft Fabric PySpark Notebook using OpenAI


Languages

Language:Python 100.0%