BUG: Repeated dataset registration in subprocess mode causes issues in interactive session
jeandut opened this issue · comments
What are you trying to do?
I am iteratively debugging code with substraFL in subprocess mode in a notebook, which means I run the same lines repeatedly notably the client instantiation and dataset registration.
In subprocess client's organization are incremented by 1 each time a client is created therefore with 2 clients repeatedly calling a line instantiating clients creates: MyOrg1, MyOrg2, MyOrg3, MyOrg4, etc.
On the contrary when doing the registration of the datasets in the in RAM db the first registration will prevail for the data aka registering twice the same data will go to the same orgs (in this case Org1 and Org2).
Therefore when calling it twice I will be sending tasks with MyOrg3 and MyOrg4 on datasets held by MyOrg1 and MyOrg2 raising this error:
InvalidRequest: The task worker must be the organization that contains the data: MyOrg1MSP
The code run multiple times would be:
for i in range(n_clients):
clients.append(Client(backend_type="subprocess"))
clients = {c.organization_info().organization_id: c for c in clients}
# Store organization IDs
ORGS_ID = list(clients.keys())
ALGO_ORG_ID = ORGS_ID[0] # Algo provider is defined as the first organization.
DATA_PROVIDER_ORGS_ID = ORGS_ID
if data_path is None:
(Path.cwd() / "tmp").mkdir(exist_ok=True)
data_path = Path.cwd() / "tmp" / "data_eca"
else:
data_path = Path(data_path)
(data_path).mkdir(exist_ok=True)
n_per_client = int(len(df.index) / n_clients)
dfs = []
for i in range(n_clients):
os.makedirs(data_path / f"center{i}", exist_ok=True)
cdf = df.iloc[
range(i * n_per_client, max((i + 1) * n_per_client, len(df.index)))
]
cdf.to_csv(data_path / f"center{i}" / "data.csv", index=False)
dfs.append(cdf)
assets_directory = "scripts" / "substra_assets"
dataset_keys = {}
datasample_keys = {}
for i, org_id in enumerate(DATA_PROVIDER_ORGS_ID):
client = clients[org_id]
if len(client.list_dataset()) > 0:
# Found datasets
continue
permissions_dataset = Permissions(public=False, authorized_ids=[ALGO_ORG_ID])
# DatasetSpec is the specification of a dataset. It makes sure every field
# is well defined, and that our dataset is ready to be registered.
# The real dataset object is created in the add_dataset method.
dataset = DatasetSpec(
name=f"data{i}",
type="csv",
data_opener=assets_directory / "csv_opener.py",
description=assets_directory / "description.md",
permissions=permissions_dataset,
logs_permission=permissions_dataset,
)
dataset_keys[org_id] = client.add_dataset(dataset)
assert dataset_keys[org_id], "Missing dataset key"
# Add the training data on each organization.
data_sample = DataSampleSpec(
data_manager_keys=[dataset_keys[org_id]],
path=data_path / f"center{i}",
)
datasample_keys[org_id] = client.add_data_sample(data_sample)
train_data_nodes = []
test_data_nodes = []
for org_id in DATA_PROVIDER_ORGS_ID:
# Create the Train Data Node (or training task) and save it in a list
train_data_node = TrainDataNode(
organization_id=org_id,
data_manager_key=dataset_keys[org_id],
data_sample_keys=[datasample_keys[org_id]],
)
train_data_nodes.append(train_data_node)
# Create the Train Data Node (or training task) and save it in a list
test_data_node = TestDataNode(
organization_id=org_id,
data_manager_key=dataset_keys[org_id],
test_data_sample_keys=[datasample_keys[org_id]],
metric_keys=[],
)
# code running strategy that should abruptly be stopped by using assert False or Ctrl-C
# the second time this cell is run it will raise an error
execute_experiment(train_data_nodes, strategy)
Issue Description (what is happening?)
InvalidRequest: The task worker must be the organization that contains the data: MyOrg1MSP
is raised incorrectly
Expected Behavior (what should happen?)
No exception should be raised datasets should match clients.
Some code emptying the db called on garbage collection should work:
# We need to avoid persistence of DB in between runs this is an obscure hack but it's working
database = first_client._backend._db._db._data
if len(database.keys()) > 1:
for k in list(database.keys()):
database.pop(k)
Reproducible Example
No response
Operating system
MacOS
Python version
3.9.16
Installed Substra versions
substra==0.43.0
substrafl==0.35.1
Installed versions of dependencies
No response
Logs / Stacktrace
No response
The issue is due to the usage of client.list_dataset
. We added a few lines on the documentation to explain the behaviour and the impact of the permissions on listing assets: Substra/substra-documentation#322
To fix you script, you need to use client.list_dataset(filters={"owner":[client_org_id]})
. Using this filter, you will get the expected behaviour.
Thanks for sharing this and for helping us describing our concepts better !
I got hit quite hard by this issue again...lost 2 days of debugging. I don't know if this should be "fixed" (understand changed) or not but I think something should be done at least in the doc. Cheers.