Get TokenPointer by iterating on a DocPointer

Question

Get TokenPointer by iterating on a DocPointer

sachin-101 opened this issue 4 years ago · comments

At times, when we are only dealing with public datasets but residing on different workers (i.e. a Federated Learning Setup), I guess we should allow the user to retrieve pointers to tokens (TokenPointer) by iterating on the DocPointer.
And the TokenPointer can allow the user to retrieve the text of the corresponding Token.

Saksham · Answer 1 · Mon May 04 2020 14:48:18 GMT+0800 (China Standard Time)

Can I work on this issue?

Jatin Prakash · Answer 2 · Mon May 04 2020 15:19:39 GMT+0800 (China Standard Time)

Yes go ahead @codeboy5 👍

Sachin Kumar · Answer 3 · Mon May 04 2020 15:21:18 GMT+0800 (China Standard Time)

I think we should wait for @AlanAboudib's views on it. 🙂

Jatin Prakash · Answer 4 · Mon May 04 2020 15:53:21 GMT+0800 (China Standard Time)

I agree but we are getting a token text from a Token Pointer in #78 . It makes sense to iterate on them as well. I guess we should wait for it to get merged first then before anyone can work on it.

Nilansh Rajput · Answer 5 · Mon May 04 2020 16:40:07 GMT+0800 (China Standard Time)

@sachin-101 I don;' think it's really needed currently, Currently you can get a pointer to a tensor of all token vectors, I don't see scenario of someone benefiting from seeing text even for public dataset, you can work with the vectors directly.
It's not really a essential feature we should wait to implement this after permissions are defined in Pysyft,

Nilansh Rajput · Answer 6 · Mon May 04 2020 16:44:12 GMT+0800 (China Standard Time)

We should not implement a feature with potential vulnerability in privacy when it's not really requires/important. We can do this after public datasets and permissions are defined till then we work considering all of them as private, also if it is public you can probably get the dateset directly on your system. and you can do anything with it.

Alan Aboudib · Answer 7 · Mon May 04 2020 20:34:05 GMT+0800 (China Standard Time)

@sachin-101 @bicycleman15 @Nilanshrajput I have been actually think about this issue. I can think of two different views:

1- Nilansh's view. We let the API enforce the privacy as possible as long as no permission layer is available.
2- (PySyft philosophy) We implement functionalities that break the privacy. but, at the same time, gives us a better idea of what our needs in terms of permissions are. This feedback could be valuable to the Core, Security or PyGrid teams once they start working on permissions.

I am actually more inclined to option 1). after reading this thread. This will give us a quicker access to working on real use cases with partners without having many privacy leaking features. It also let us focus more on the private scenarios. Eager to hear your thoughts about this

Nilansh Rajput · Answer 8 · Mon May 04 2020 21:01:55 GMT+0800 (China Standard Time)

Yup, i also thought that if we move towards implement as much privacy as we can as of now, we would be able to partner for real world cases, otherwise if implemented that now in future we will have to update it again if got some real usecase and permission layers are still not developed.

Sachin Kumar · Answer 9 · Mon May 04 2020 21:48:39 GMT+0800 (China Standard Time)

@AlanAboudib @Nilanshrajput @bicycleman15 In our current scenario, we have already implemented __getitem__ on DocPointer which allows us to get a pointer for a Token.
Now, the benefits of implementing TokenPointer are huge. Cause it has enabled us to modify and add a lot of features, which are easy to think and add now but later on will become very cumbersome.

And from a privacy point of view, the only problem (I believe) in getting a TokenPointer is that, it's string() method allows us to retrieve a pointer to the String on the remote machine, on which we can call .get() and get the private string.

So what I suggest is that, we develop keeping both Federated Learning and Private NLP in mind. But when collaborating/working on a real use case, ~~comment out~~ remove those specific functions which leak data. Only those, which leak data at the very end. Eg. the string() method in TokenPointer.

And when the permissions come just a few little tweaks and we are done.

Nilansh Rajput · Answer 10 · Mon May 04 2020 22:49:41 GMT+0800 (China Standard Time)

@sachin-101 commenting out method sounds very bad and is not really practical.
One thing, that for most usecases, features for related to tokenpointer and even spanpointer are not essential, what you really need is just docpointer, for any NLP task and our priority should be give users that much functionality now without any big privacy vulnerability that we know of, so SyferText can be used easily, not having these small functionality should not hamper with the bigger picture here.

Nilansh Rajput · Answer 11 · Mon May 04 2020 22:59:20 GMT+0800 (China Standard Time)

The point is, we should develop SyferText keeping in mind the first priority which is Privacy,
later we can include features which might be usefull at some level for public datasets.

@AlanAboudib @bicycleman15 @sachin-101

Alan Aboudib · Answer 12 · Tue May 05 2020 20:11:37 GMT+0800 (China Standard Time)

Let's put it that way. If we know why we need SpanPointer and TokenPointer. Like a concrete use case that we can think of we keep it. If we don't we remove them. In this is sense I agree with @Nilanshrajput but also I am willing to listen to @sachin-101 and @bicycleman15 thoughts why they built those features.

I am strongly in favor of building features only if we have a strong idea why we are building them.

Sachin Kumar · Answer 13 · Tue May 05 2020 20:22:22 GMT+0800 (China Standard Time)

@AlanAboudib SpanPointer has all the same functionalities as a DocPointer, which are getting SMPC encrypted vectors.. and getting encrypted token vectors. And any other functionalities you can think of DocPointer.

(Although I am in favour of having a parent class for Doc and Span, and one parent for DocPointer and SpanPointer as well.)

Sachin Kumar · Answer 14 · Tue May 05 2020 20:25:07 GMT+0800 (China Standard Time)

@AlanAboudib Regarding the TokenPointer, I literally can't think of any specific use case. It was built for consistency and was suggested by you. 🙃

In an FL setup, we can actually call .get() on DocPointer rather than calling .get() on each TokenPointer.
@bicycleman15 any suggestions?

Nilansh Rajput · Answer 15 · Tue May 05 2020 20:32:29 GMT+0800 (China Standard Time)

@sachin-101 But we know the usecase for getting vectors from doc pointer, but i don't see any use of getting vectors of Span for any nlp tasks that I can think of

Sachin Kumar · Answer 16 · Tue May 05 2020 20:35:26 GMT+0800 (China Standard Time)

Span is just a section of Doc. What you can do on the whole, you can do on a part. It's more of a design principle.

Nilansh Rajput · Answer 17 · Tue May 05 2020 20:38:52 GMT+0800 (China Standard Time)

Yup i got that, but thing is you don't use vectors from spans directly for most usecase, even if you can! Talking for any practical usecase!

Jatin Prakash · Answer 18 · Tue May 05 2020 21:48:28 GMT+0800 (China Standard Time)

There might be case when someone needs sentence embedding as well. In that case, Span Pointer can come useful.

Jatin Prakash · Answer 19 · Tue May 05 2020 21:52:15 GMT+0800 (China Standard Time)

A simple application of sentence embeddings can be seen in deep averaging networks where people usually use sentence embeddings in sentiment analysis and question answering. In fact sentence embeddings are used to build universal sentence encoders.

Jatin Prakash · Answer 20 · Tue May 05 2020 21:59:26 GMT+0800 (China Standard Time)

Of course, the same thing can be done with doc pointer too, but then when someone will use syfertext they will usually pass sentenciser as a pipeline when they want to work with sentences. And naturally they will get back Spans or Span Pointers in the case when dataset is remote. Converting them to Doc pointers will confuse the user as locally they get Spans but now when they work remotely they are getting back Doc Pointers (when they were expecting Span Pointers because they got Span locally). That is why we have kept Span Pointers. Those are my thoughts @AlanAboudib @sachin-101 @Nilanshrajput 🙂

Nilansh Rajput · Answer 21 · Tue May 05 2020 22:02:02 GMT+0800 (China Standard Time)

@bicycleman15 I think you are right here, although user could hack around with data to pass a sentence to doc, but ya would be better if give that functionality. But there should be know text atriburtes in span pointer.And I am still in favour of removing tokenpointer!

Jatin Prakash · Answer 22 · Tue May 05 2020 22:02:39 GMT+0800 (China Standard Time)

Regarding Token Pointers, we built them for consistency purposes only. I don't think they have any use case 😅 . So yeah I think they could be removed

Sachin Kumar · Answer 23 · Mon May 11 2020 00:29:35 GMT+0800 (China Standard Time)

Great, so we shall remove the TokenPointer from #78 .