metagraph-dev / metagraph

Multi-target API for graph analytics with Dask

Home Page:https://metagraph.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add `NodeEmbeddings` abstract type

eriknw opened this issue · comments

This is like a NodeMap where the value is a Vector. Just as a NodeMap can be converted to a Vector, a NodeEmbedding can be converted to a Matrix.

I'm unsure about the name. NodesEmbedding? NodeEmbeddings? NodeMapToVectors? NodeMapOfVectors? NodeToVectors?

I'm going to bring up some thoughts and questions I had about the Embedding type given what we discussed in our last meeting.

Embedding Implementation

In our last meeting, we talked about how it's useful to separate out the training and inference phases of the embedding algorithm. Here was my thought on how we could accomplish this in an Embedding abstract type:

  • It should have a required __call__ method.
  • It should have a required input_type property.
  • It should have a required return_type property. This should be a Matrix concrete type.
  • The __call__ method would take a tuple of inputs, which can be graphs, nodes, tuples (tuples are useful since graph_sage takes a graph+node and returns an embedding representing that node’s embedding within the graph), edges, etc. and return a matrix of size input_count x embedding_size.
  • Example:
embedding = res.algos.embedding.graph_sage(..., embedding_size=500, ...)
matrix = embedding( ((graph_1, node_1), ..., (graph_N, node_N)) ) # matrix has shape N x 500
  • Example:
embedding = res.algos.embedding.node2vec(..., embedding_size=200, ...)
matrix = embedding( (node_1, ..., node_N) ) # matrix has shape N x 200

This would sufficiently separate the training and inference phases of the embedding. The training happens when the embedding algorithm is called, but the inference happens when the returned Embedding's __call__ method is used.

What are your thoughts on this proposal?

Further Thoughts

It may be the case that the input_type and return_type properties aren't strictly necessary, but they seem useful for validating inputs. Perhaps we can go without them. If we do, does it make sense to have an embedding type in metagraph? Would simply returning a callable be sufficient?

Perhaps they can be useful if we want to embed graphs, nodes, etc. of a different type. The resolver's translator would usefully use these types. This might require us to have a "perform embedding inference" algorithm to be called in the resolver instead of simply using a __call__ method. Does this route sound reasonable? It would certainly make performing the inference require more verbose code.

Example:

embedding = res.algos.embedding.node2vec(..., embedding_size=200, ...)
matrix = res.algos.embedding.apply_embedding(embedding, (node_1, ..., node_N) ) # matrix has shape N x 200

Embedding Translation

When it comes to the embedding abstract/concrete type, it’s not exactly clear what it means to translate from one to another since the embedding will be a callable. Should we forbid translations on embedding?

Motivation Behind Embedding Type

Regarding whether or not an embedding type is motivated, it seems useful to me because we could run the embedding algorithm on the GPU, get our callable that’ll return a GPU matrix, use metagraph to translate the GPU matrix to PUMA, and then do other stuff with those matrices on PUMA. Even though we don’t have many algorithms taking in matrices right now, it still seems useful for the expert users who will take these matrices and do something in some other library (e.g. PyTorch, TensorFlow, etc.) with them.

Is this sufficiently motivating? Are there any concerns I did not mention?