Application which utilizes the CDF APIs to replicate data to and from Microsoft Fabric
The replicator consists of three services:
- Time series replicator - Copies time series data from CDF to Fabric
- Data model replicator - Copies data model nodes and edges from CDF to Fabric
- Fabric data extractor - Copies time series, events, and files from Fabric to CDF
All three services will run concurrently during the run of the CDF Fabric Replicator program. The services use a state store in CDF's raw storage to maintain checkpoints of when the latest data was read, so the services can be started and stopped and will be able to pick back up where they left off.
The time series replicator uses data point subscriptions to get updates on incoming time series data. Currently the only way to create these subscriptions is by using the Cognite SDK.
Here is an example of how to set up a subscription using the Python SDK:
First, install the SDK using pip
:
pip install cognite-sdk
Next, set up the OAuth Credentials to use for authentication for the client. You can get these values from an administrator:
from cognite.client.credentials import OAuthClientCredentials
import os
oauth_creds = OAuthClientCredentials(
token_url="https://login.microsoftonline.com/xyz/oauth2/v2.0/token", # Auth token URL, replace "xyz" with Azure tenant ID
client_id="abcd", # Client ID of the service principal for interacting with Cognite
client_secret=os.environ["OAUTH_CLIENT_SECRET"], # Secret for the service principal, save as an environment variable as a best practice
scopes=["https://greenfield.cognitedata.com/.default"], # Scope, contains cluster name for the CDF project
)
Create the Cognite Client using these credentials:
from cognite.client import CogniteClient, ClientConfig, global_config
cnf = ClientConfig(
client_name="my-special-client",
base_url="https://greenfield.cognitedata.com", # Base URL for CDF project, includes cluster name
project="project-name", # CDF project name
credentials=oauth_creds # OAuth credentials from earlier
)
global_config.default_client_config = cnf
client = CogniteClient()
Finally, create the subscriptions by referencing the external IDs of the time series to which you would like to subscribe:
from cognite.client.data_classes import DataPointSubscriptionWrite
sub = DataPointSubscriptionWrite(external_id="mySubscription", partition_count=1, time_series_ids=["myFistTimeSeries", "mySecondTimeSeries"], name="My subscription")
created = client.time_series.subscriptions.create(sub)
The external ID of the subscription (in this case, "mySubscription") will be used in the configuration file for the replicator. For more specifics on the SDK, please refer to the SDK documentation.
You can optionally copy the contents of .env.example
to a .env
file that will be used to set the values in a config yaml file:
COGNITE_BASE_URL
: The base URL of the Cognite project, i.e. https://<cluster_name>.cognitedata.comCOGNITE_PROJECT
: The project ID of the Cognite project.COGNITE_TOKEN_URL
: The URL to obtain the authentication token for the Cognite project, i.e. https://login.microsoftonline.com/<tenant_id>/oauth2/v2.0/tokenCOGNITE_CLIENT_ID
: The client ID for authentication with the Cognite project.COGNITE_CLIENT_SECRET
: The client secret for authentication with the Cognite project.COGNITE_STATE_DB
: The database in CDF raw storage where the replicator state should be stored.COGNITE_STATE_TABLE
: The table in CDF raw storage where the replicator state should be stored. The replicator will create the table if it does not exist.COGNITE_EXTRACTION_PIPELINE
: The extractor pipeline in CDF for the replicator. Learn more about configuring extractors remotely
LAKEHOUSE_ABFSS_PREFIX
: The prefix for the Azure Blob File System Storage (ABFSS) path. Should match patternabfss://<workspace_id>@msit-onelake.dfs.fabric.microsoft.com/<lakehouse_id>
. Get this value by selecting "Properties" on your Lakehouse Tables location and copying "ABFS path".DPS_TABLE_NAME
: The name of the table where data point values and timestamps should be stored in Fabric. The replicator will create the table if it does not exist.TS_TABLE_NAME
: The name of the table where time series metadata should be stored in Fabric. The replicator will create the table if it does not exist.
EXTRACTOR_EVENT_PATH
: The ABFSS file path for the events table in a Fabric lakehouse.EXTRACTOR_FILE_PATH
: The ABFSS file path for the files in a Fabric lakehouse.EXTRACTOR_RAW_TS_PATH
: The ABFSS file path for the gittimeseries table in a Fabric lakehouse.EXTRACTOR_DATASET_ID
: Specifies the ID of the extractor dataset when the data lands in CDF.EXTRACTOR_TS_PREFIX
: Specifies the prefix for the extractor timeseries when the data lands in CDF.
TEST_CONFIG_PATH
: Specifies the path to the test configuration file with which test versions of the replicator are configured.
The replicator reads its configuration from a YAML file specified in the run command. You can configure your own YAML file based on the one in example_config.yaml
in the repo. That configuration file uses the environment variables in .env
, the configuration can also be set using hard coded values.
subscriptions
and data_modeling
configurations are a list, so you can configure multiple data point subscriptions or data modeling spaces to replicate into Fabric.
logger:
console:
level: INFO
# Cognite project to stream your datapoints from
cognite:
host: ${COGNITE_BASE_URL}
project: ${COGNITE_PROJECT}
idp-authentication:
token-url: ${COGNITE_TOKEN_URL}
client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_BASE_URL}/.default
extraction-pipeline:
external-id: ts-sub
#Extractor config
extractor:
state-store:
raw:
database: ${COGNITE_STATE_DB}
table: ${COGNITE_STATE_TABLE}
subscription-batch-size: 10000
ingest-batch-size: 100000
poll-time: 5
# subscriptions to stream
subscriptions:
- external_id: ts-subscription
partitions:
- 0
lakehouse_abfss_path_dps: ${LAKEHOUSE_ABFSS_PREFIX}/Tables/${DPS_TABLE_NAME}
lakehouse_abfss_path_ts: ${LAKEHOUSE_ABFSS_PREFIX}/Tables/${TS_TABLE_NAME}
# sync data model
data_modeling:
- space: cc_plant
lakehouse_abfss_prefix: ${LAKEHOUSE_ABFSS_PREFIX}
source:
abfss_prefix: ${LAKEHOUSE_ABFSS_PREFIX}
event_path: ${EXTRACTOR_EVENT_PATH}
file_path: ${EXTRACTOR_FILE_PATH}
raw_time_series_path: ${EXTRACTOR_RAW_TS_PATH}
data_set_id: ${EXTRACTOR_DATASET_ID}
destination:
type: ${EXTRACTOR_DESTINATION_TYPE}
time_series_prefix: ${EXTRACTOR_TS_PREFIX}
To run the cdf_fabric_replicator
application, you can use Poetry, a dependency management and packaging tool for Python.
First, make sure you have Poetry installed on your system. If not, you can install it by following the instructions in the Poetry documentation.
Once Poetry is installed, navigate to the root directory of your project in your terminal.
Next, run the following command to install the project dependencies:
poetry install
Finally, run the replicator:
poetry run cdf_fabric_replicator config.yaml