Introduction

This tool uses OpenAI API to convert Azure Mapping Dataflow code to (Microsoft Fabric PySpark)[https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook) code using OpenAI API.

The tool will use ADF get REST API to get dataflow script code or File source and OpenAI to convert script code into PySpark Notebook. We need to pass a few input parameters based on the source of the Mapping Dataflow. We can also pass targeted Fabric resources like workspace ID, lakehouse name, and lakehouse ID to set these parameters into notebook metadata.

The tool is not tested with all transformations supported by Azure Mapping dataflow.

OpenAI Privacy.

💁 You can try converting Azure Mapping dataflow using Scala Combinator custom parsers to Fabric Notebook from mapping-data-flow-to-spark

Design

Installation

Python > 3.10.11
pip install -r requirements.txt

Usages

Set following environment variables:

Mandatory

OPENAI_API_KEY - Your openap api key

Optional

LOG_LEVEL - Optional debug/info, default value from the application is info
OPENAI_MODEL - OpenAI model name, default value from the application is gpt-4

⚠️don't forget to read limitation before you run a large conversion

Get DataFlow Script Lines from API

You need to pass following parameters:

source=api
rg - resource group name
dataFlowName - data flow name
factoryName - Azure data factory name
lakeHouseId - Existing target Microsoft Fabric lakehouse Id
lakeHouseName - Existing target Microsoft Fabric lakehouse name
workSpaceId - Existing target Microsoft Fabric workspace Id
subscriptionId - subscription id

python.exe main.py --kwargs source=api rg=<resource group> dataFlowName=<dataflow name> factoryName=<adf name> \
lakeHouseId=<fabric lakehouse id> lakeHouseName=<fabric lakehouse name> workSpaceId=<fabric workspace id> \
subscriptionId=<azure subscription id>

Get DataFlow Script Lines from local file

You need to pass following parameters:

source=file
dataFlowName - data flow name
lakeHouseId - Existing target Microsoft Fabric lakehouse Id
lakeHouseName - Existing target Microsoft Fabric lakehouse name
workSpaceId - Existing target Microsoft Fabric workspace Id

python.exe main.py --kwargs source=file sourceFile=<dataflow script code file path> dataFlowName=<dataflow name>\
lakeHouseId=<fabric lakehouse id> lakeHouseName=<fabric lakehouse name> workSpaceId=<fabric workspace id>

There are two output files will be generated:

Notebook with dataflow name
PySpark code in .py file

Limitation

Since we use ChatCompletion API from OpenAI to generate a desired output or response, we need to consider the length of the input text based on the token limit of the model we choose to use. For example, GPT-4 can handle up to 8,192 tokens per input, while GPT-3 can only handle up to 4,096 tokens. This means that texts that are longer than the token limit of the model will not fit and may be cut off or ignored.

The max_tokens parameter in the ChatCompletion API allows you to limit the length of the input or output generated by the model to a specified number of tokens. Tokens are chunks of text that language models read, and they can be as short as one character or as long as one word, depending on the language and context.

Setting a very low value for max_tokens can result in the response being cut off abruptly, potentially leading to an output that doesn't make sense or lacks context. The max_tokens parameter is a useful tool to control response length, but setting it too low can negatively impact the quality and coherence of the responses.

What does it mean to the user? It means if your generated code length is > 8k, It will be truncated, and the result will not be complete code.

OpenAI is gpt-4-32k has been around for a while, but extremely limited rollout.

Future Scope of work

Integration with Azure OpenAI
How to extend this when mapping dataflow script code is > 8192 tokens?

sethiaarun / mapping-dataflow-to-fabric-with-openai