OpenMined / SyferText

A privacy preserving NLP framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add functionality for building vocab.

Nilanshrajput opened this issue · comments

Build vocab for a given dataset.

build vocab from an iterator over strings of data
add functions to get all vectors loaded from dataset (important for assigning weight to embedding layer) .

I am not sure to understand the issue @Nilanshrajput, what do you mean by:

Usually we have separate vocab for data, and labels and we should able to build vocab for them separately.

Acutualy this example is wrong, label vocab doesn't mean anything(it's just tag map)
The issue is for building Vocab over the given corpus(word embeddings of custom size),
in nn.Embedding(input_dim, embedding_dim) layer you can replace the intial weights with the pre-trained vectors. Also its good to keep the input_dim = numbers of unique strings in corpus instead of some random number.
for getting these values we need to iterate over the complete dataset and build vocab.

@Nilanshrajput Good point. This is closely related to what @sachin-101 is working on. Building a cross-worker vocabulary and giving each token an index in this vocabulary. Basically, we consider that token indexes are private. Thus, the embedding layer should be located on the same worker as the dataset.

Then we can raise the following the question: is the vocabulary iteself private? I think there isn't a correct answer to this question, we should treat both cases. The private case, and public case for vocab.

I agree then, the first step is to send a computation object that takes the dataset object handle and outputs a set of unique tokens, that we can use in PSI as @sachin-101 is doing if multiple workers are involved.

I think that the data owner should decide whether their vocab itself is private using a property of the future dataset object that we will create (private_vocab: False/True)

@Nilanshrajput I will close this since @sachin-101 work on PSI address part of this issue. Also, I think that this is a family of issues not a single issue. It will become clearer when we start implementing different model types.