Substra / substra

Low-level Python library used to interact with a Substra network

Home Page:https://docs.substra.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BUG: Repeated dataset registration in subprocess mode causes issues in interactive session

jeandut opened this issue · comments

What are you trying to do?

I am iteratively debugging code with substraFL in subprocess mode in a notebook, which means I run the same lines repeatedly notably the client instantiation and dataset registration.
In subprocess client's organization are incremented by 1 each time a client is created therefore with 2 clients repeatedly calling a line instantiating clients creates: MyOrg1, MyOrg2, MyOrg3, MyOrg4, etc.
On the contrary when doing the registration of the datasets in the in RAM db the first registration will prevail for the data aka registering twice the same data will go to the same orgs (in this case Org1 and Org2).
Therefore when calling it twice I will be sending tasks with MyOrg3 and MyOrg4 on datasets held by MyOrg1 and MyOrg2 raising this error:

InvalidRequest: The task worker must be the organization that contains the data: MyOrg1MSP

The code run multiple times would be:

   for i in range(n_clients):
            clients.append(Client(backend_type="subprocess"))
    clients = {c.organization_info().organization_id: c for c in clients}
    # Store organization IDs
    ORGS_ID = list(clients.keys())
    ALGO_ORG_ID = ORGS_ID[0]  # Algo provider is defined as the first organization.
    DATA_PROVIDER_ORGS_ID = ORGS_ID

    if data_path is None:
        (Path.cwd() / "tmp").mkdir(exist_ok=True)
        data_path = Path.cwd() / "tmp" / "data_eca"
    else:
        data_path = Path(data_path)

    (data_path).mkdir(exist_ok=True)
    n_per_client = int(len(df.index) / n_clients)
    dfs = []
    for i in range(n_clients):
        os.makedirs(data_path / f"center{i}", exist_ok=True)
        cdf = df.iloc[
            range(i * n_per_client, max((i + 1) * n_per_client, len(df.index)))
        ]
        cdf.to_csv(data_path / f"center{i}" / "data.csv", index=False)
        dfs.append(cdf)

    assets_directory = "scripts" / "substra_assets"
    dataset_keys = {}
    datasample_keys = {}
    for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID):
        client = clients[org_id]
        if len(client.list_dataset()) > 0:
            # Found datasets
            continue
        permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID])

        # DatasetSpec is the specification of a dataset. It makes sure every field
        # is well defined, and that our dataset is ready to be registered.
        # The real dataset object is created in the add_dataset method.

        dataset = DatasetSpec(
                name=f"data{i}",
                type="csv",
                data_opener=assets_directory / "csv_opener.py",
                description=assets_directory / "description.md",
                permissions=permissions_dataset,
                logs_permission=permissions_dataset,
            )
        dataset_keys[org_id] = client.add_dataset(dataset)
        assert dataset_keys[org_id], "Missing dataset key"

        # Add the training data on each organization.
        data_sample = DataSampleSpec(
                data_manager_keys=[dataset_keys[org_id]],
                path=data_path / f"center{i}",
            )
        datasample_keys[org_id] = client.add_data_sample(data_sample)

    train_data_nodes = []
    test_data_nodes = []
    for org_id in DATA_PROVIDER_ORGS_ID:
        # Create the Train Data Node (or training task) and save it in a list
        train_data_node = TrainDataNode(
            organization_id=org_id,
            data_manager_key=dataset_keys[org_id],
            data_sample_keys=[datasample_keys[org_id]],
        )
        train_data_nodes.append(train_data_node)

        # Create the Train Data Node (or training task) and save it in a list
        test_data_node = TestDataNode(
            organization_id=org_id,
            data_manager_key=dataset_keys[org_id],
            test_data_sample_keys=[datasample_keys[org_id]],
            metric_keys=[],
        )
# code running strategy that should abruptly be stopped by using assert False or Ctrl-C
# the second time this cell is run it will raise an error
execute_experiment(train_data_nodes, strategy)

Issue Description (what is happening?)

InvalidRequest: The task worker must be the organization that contains the data: MyOrg1MSP
is raised incorrectly

Expected Behavior (what should happen?)

No exception should be raised datasets should match clients.
Some code emptying the db called on garbage collection should work:

# We need to avoid persistence of DB in between runs this is an obscure hack but it's working
database = first_client._backend._db._db._data
 if len(database.keys()) > 1:
    for k in list(database.keys()):
        database.pop(k)

Reproducible Example

No response

Operating system

MacOS

Python version

3.9.16

Installed Substra versions

substra==0.43.0
substrafl==0.35.1

Installed versions of dependencies

No response

Logs / Stacktrace

No response

The issue is due to the usage of client.list_dataset. We added a few lines on the documentation to explain the behaviour and the impact of the permissions on listing assets: Substra/substra-documentation#322

To fix you script, you need to use client.list_dataset(filters={"owner":[client_org_id]}). Using this filter, you will get the expected behaviour.

Thanks for sharing this and for helping us describing our concepts better !

I got hit quite hard by this issue again...lost 2 days of debugging. I don't know if this should be "fixed" (understand changed) or not but I think something should be done at least in the doc. Cheers.