Substra / substra

Low-level Python library used to interact with a Substra network

Home Page:https://docs.substra.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request] Passing list of data hashes instead of sets to traintuples

jeandut opened this issue · comments

One use-case that is not supported as of today in Substra would be to easily use a static algo.py to do either only one training epoch over a dataset or multiple epochs without any modifications to the algo or the Dockerfile.

The limitation is due to the fact that there is a uniqueness test on the data-hashes one passes to a traintuple to make sure they are unique (can be cast into a set).
If the hashes given to the traintuple are not unique Substra raises:
error is: Key: 'inputTraintuple.DataSampleKeys' Error:Field validation for 'DataSampleKeys' failed on the 'unique' tag"}
and the traintuple cannot be processed.

For instance consider the case where you want your traintuple to operate on the following hashes:
["A","B","C","D"]:

One might want to use a single algorithm similar to the following:
algo.py:

import json
import substratools as tools
from networks import MyNetwork

class ComputeUpdates(tools.Algo):
    def train(self, X, y, models, rank):
        #assuming the opener created a [n_hashes, d] numpy array X and a vector y of size [n_hashes, dtarget]
        my_model = MyNetwork()
        for i in range(X.shape[0]):
            my_model.update_weights(X[i], y[i])
        return my_model

    def predict(self, X, model):
        predictions = 0
        return predictions

    def load_model(self, path):
        return json.load(path)

    def save_model(self, model, path):
        json.dump(model, path)


if __name__ == '__main__':
    tools.algo.execute(ComputeUpdates())

So to do one epoch one would register a traintuple using this algorithm and the following data samples hashes set s=["A","B","C","D"] and it would work.

Now to do multiple epochs (N) instead of one, without modifying algo, the obvious solution would be to just pass s*N instead of s during the registration of the traintuple.

However it is not possible as it would raise the above error in Substra as the hashes cannot be cast as a set because each hash is present N times.

This feature would be very valuable for my workflow !

It would also allow to support more complicated plans where we want to do a floating number of epochs by passing just the hashes of the samples that we intend to see (some samples multiple times some samples just once).

Thanks again !

My question is very naive, but wouldn't it be equivalent to creating N traintuples each taking as input the full set of data samples s and the trained model from the previous step?

Yes it would be equivalent mathematically in this case, but would induce lag because of the docker spawning at each traintuple processing and be a bit more complicated in terms of logic.

Un order to reduce the lag to very little, you could use a compute plan to create all the traintuples at once. The docker images created for the compute plan won't be removed until the end of the compute plan, so spawning a new container is very fast (no need to rebuild between traintuples).

Closing as stale