OpenMined / SyferText

A privacy preserving NLP framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Get TokenPointer by iterating on a DocPointer

sachin-101 opened this issue Β· comments

At times, when we are only dealing with public datasets but residing on different workers (i.e. a Federated Learning Setup), I guess we should allow the user to retrieve pointers to tokens (TokenPointer) by iterating on the DocPointer.
And the TokenPointer can allow the user to retrieve the text of the corresponding Token.

Can I work on this issue?

Yes go ahead @codeboy5 πŸ‘

I think we should wait for @AlanAboudib's views on it. πŸ™‚

I agree but we are getting a token text from a Token Pointer in #78 . It makes sense to iterate on them as well. I guess we should wait for it to get merged first then before anyone can work on it.

@sachin-101 I don;' think it's really needed currently, Currently you can get a pointer to a tensor of all token vectors, I don't see scenario of someone benefiting from seeing text even for public dataset, you can work with the vectors directly.
It's not really a essential feature we should wait to implement this after permissions are defined in Pysyft,

We should not implement a feature with potential vulnerability in privacy when it's not really requires/important. We can do this after public datasets and permissions are defined till then we work considering all of them as private, also if it is public you can probably get the dateset directly on your system. and you can do anything with it.

@sachin-101 @bicycleman15 @Nilanshrajput I have been actually think about this issue. I can think of two different views:

1- Nilansh's view. We let the API enforce the privacy as possible as long as no permission layer is available.
2- (PySyft philosophy) We implement functionalities that break the privacy. but, at the same time, gives us a better idea of what our needs in terms of permissions are. This feedback could be valuable to the Core, Security or PyGrid teams once they start working on permissions.

I am actually more inclined to option 1). after reading this thread. This will give us a quicker access to working on real use cases with partners without having many privacy leaking features. It also let us focus more on the private scenarios. Eager to hear your thoughts about this

Yup, i also thought that if we move towards implement as much privacy as we can as of now, we would be able to partner for real world cases, otherwise if implemented that now in future we will have to update it again if got some real usecase and permission layers are still not developed.

@AlanAboudib @Nilanshrajput @bicycleman15 In our current scenario, we have already implemented __getitem__ on DocPointer which allows us to get a pointer for a Token.
Now, the benefits of implementing TokenPointer are huge. Cause it has enabled us to modify and add a lot of features, which are easy to think and add now but later on will become very cumbersome.

And from a privacy point of view, the only problem (I believe) in getting a TokenPointer is that, it's string() method allows us to retrieve a pointer to the String on the remote machine, on which we can call .get() and get the private string.

So what I suggest is that, we develop keeping both Federated Learning and Private NLP in mind. But when collaborating/working on a real use case, comment out remove those specific functions which leak data. Only those, which leak data at the very end. Eg. the string() method in TokenPointer.

And when the permissions come just a few little tweaks and we are done.

@sachin-101 commenting out method sounds very bad and is not really practical.
One thing, that for most usecases, features for related to tokenpointer and even spanpointer are not essential, what you really need is just docpointer, for any NLP task and our priority should be give users that much functionality now without any big privacy vulnerability that we know of, so SyferText can be used easily, not having these small functionality should not hamper with the bigger picture here.

The point is, we should develop SyferText keeping in mind the first priority which is Privacy,
later we can include features which might be usefull at some level for public datasets.

@AlanAboudib @bicycleman15 @sachin-101

Let's put it that way. If we know why we need SpanPointer and TokenPointer. Like a concrete use case that we can think of we keep it. If we don't we remove them. In this is sense I agree with @Nilanshrajput but also I am willing to listen to @sachin-101 and @bicycleman15 thoughts why they built those features.

I am strongly in favor of building features only if we have a strong idea why we are building them.

@AlanAboudib SpanPointer has all the same functionalities as a DocPointer, which are getting SMPC encrypted vectors.. and getting encrypted token vectors. And any other functionalities you can think of DocPointer.

(Although I am in favour of having a parent class for Doc and Span, and one parent for DocPointer and SpanPointer as well.)

@AlanAboudib Regarding the TokenPointer, I literally can't think of any specific use case. It was built for consistency and was suggested by you. πŸ™ƒ

In an FL setup, we can actually call .get() on DocPointer rather than calling .get() on each TokenPointer.
@bicycleman15 any suggestions?

@sachin-101 But we know the usecase for getting vectors from doc pointer, but i don't see any use of getting vectors of Span for any nlp tasks that I can think of

Span is just a section of Doc. What you can do on the whole, you can do on a part. It's more of a design principle.

Yup i got that, but thing is you don't use vectors from spans directly for most usecase, even if you can! Talking for any practical usecase!

There might be case when someone needs sentence embedding as well. In that case, Span Pointer can come useful.

A simple application of sentence embeddings can be seen in deep averaging networks where people usually use sentence embeddings in sentiment analysis and question answering. In fact sentence embeddings are used to build universal sentence encoders.

Of course, the same thing can be done with doc pointer too, but then when someone will use syfertext they will usually pass sentenciser as a pipeline when they want to work with sentences. And naturally they will get back Spans or Span Pointers in the case when dataset is remote. Converting them to Doc pointers will confuse the user as locally they get Spans but now when they work remotely they are getting back Doc Pointers (when they were expecting Span Pointers because they got Span locally). That is why we have kept Span Pointers. Those are my thoughts @AlanAboudib @sachin-101 @Nilanshrajput πŸ™‚

@bicycleman15 I think you are right here, although user could hack around with data to pass a sentence to doc, but ya would be better if give that functionality. But there should be know text atriburtes in span pointer.And I am still in favour of removing tokenpointer!

Regarding Token Pointers, we built them for consistency purposes only. I don't think they have any use case πŸ˜… . So yeah I think they could be removed

Great, so we shall remove the TokenPointer from #78 .