gonum / graph

Graph packages for the Go language [DEPRECATED]

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

graph/simple: node retrieval from edges is subtle

kortschak opened this issue · comments

Currently when a concrete graph that stores nodes returns a graph.Edge the edge holds its own copies of the edge end points. In the cases where they are used in tests this is not an issue because one concrete.Node is indistiguishable from another if they are converted from the same int. However we say that we will look after graph.Node interface values too (otherwise why is graph.Node not just an int).

The consequence of these things is that I add two nodes and set an edge between them, it is possible to return nodes which are not the node that is stored in the graph's node list.

For example:

type node struct {
    id int
    v  string
}

func (n node) ID() int { return n.id }

func main() {
    g := concrete.NewDirectedGraph()
    g.AddNode(node{id: 1, v: "foo"})
    g.AddNode(node{id: 2, v: "bar"})
    g.AddDirectedEdge(concrete.Edge{F: concrete.Node(1), T: concrete.Node(2)}, 0)
    fmt.Println(g.EdgeTo(concrete.Node(1), concrete.Node(2)))
}

prints

{1 2}

when I would expect it to print

{{1 foo} {2 bar}}

It's not obvious how to fix this while retaining the simplicity of the graph.Edge interface, at least when using the current approach to edge handling. The issue is that there is no way for the graph to be able to mutate a graph.Edge that is guaranteed to work; either the edge needs to be stored as a pair of node IDs and a weight (maybe) and then the edge is constructed when g.Edge* is called (we can't do this at the moment) or we take the edge we are handed, mutate it to ensure it is holding the nodes that the graph has (again we can't do this at the moment) and store it. Of these two impossible options, I prefer the first.

One possible approach is say that concrete graphs we provide always return a concrete.Edge (of whatever variety depending on other issues #53). This also solves another issue that I have been mulling over but not yet filed. If clients want other edge types to be returned they must define their own type.

Alternatively, we add SetFrom and SetTo methods to graph.Edge (and not just mutable edges - whatever that might be - all edges). While this adds further methods to the API, I think it is probably the better answer, though it opens the possibility for clients to do bad things like changing the From or To values of an edge without telling the graph. Because of this, it should be documented that an edge must not be modified outside the graph's API if it is held by a graph.

After thinking about this further, I think the additional methods approach is the correct one.

This allows us to also make semantically useful edge returns (the other issue alluded to above); this means for example that when a client calls ug.EdgeBetween(u, v) they can be given an edge that gives e.From() == u and e.To() == v always. The advantage of this is that the client can then assume that To is the other node. There has been some suggestion of adding In() []Edge and an Out() []Edge methods to the graph interfaces. ensuring that the edges were always facing the correct direction would ease client code (this would not affect directed graphs which already hold the correct direction by definition).

The PR to fix #53 would then also see the addition of SetWeight(float64).

An alternative to adding SetFrom/To would be to use a similar approach used now in mat64 with the implicit transpose; we could add a Reversed edge type that reverses the orientation. This prevents mutation of the internally stored Edge by gives the effect we want. A ReverseOf function would perform the equivalent of the T method in mat64. So:

// Reversed modifies the Edge field to reverse its orientation.
type Reversed struct{ Edge }

// From returns the To node of the reversed Edge.
func (e Reversed) From() Node { return e.Edge.To() }

// To returns the From node of the reversed Edge.
func (e Reversed) To() Node { return e.Edge.From() }

// Unreverse performs an implicit reversal by returning the Edge field.
func (e Reversed) Unreverse() Edge { return e.Edge }

// Unreverser is a type that can undo an implicit edge reversal.
type Unreverser interface {
    // Unreverse returns the underlying Edge
    // stored for the implicit reversal.
    Unreverse() Edge
}

// ReverseOf performs an implicit reversal of e.
func ReverseOf(e Edge) Edge {
    if e, ok := e.(Unreverser); ok {
        return e.Unreverse()
    }
    return Reversed{e}
}

Note that this does not solve the primary problem, but only the edge orientation issue.

An alternative to adding mutation methods to graph.Edge is to change the interface:

type Edge interface {
    FromID() int
    ToID() int
    Weight() float64
}

and add Node(int) Node to graph.Graph.

This, in conjunction with the Reversed type above covers all the issues, and I think does so reasonably efficiently (many times, an Edge is only used to get a node ID anyway - I'll collect stats on uses - it seems that most cases either use the node to get an ID or to pass as a graph.Node that will be used as a lookup for as an ID where a simple.Node or equivalent could replace the graph.Node itself).

@vladimir-ch What do you think?

(Me doing some heavy context switching) This seems like it would require some radical changes to the package. Without looking at the source code, Has() would have to take int (which is probably a good thing), and missing edge nodes could not be added together with the edge. For consistency also the other query methods should probably take a node id. But it seems natural to use ints instead of Node interface. If the graph holds node from your example above, then adding an edge between two simple.Node is strange. Am I correct or do you have something else on your mind?

Thanks, you pointed to some things that I had ignored. I had not thought it necessary to change much else - I had thought the rest would continue to take graph.Node, though there is no real reason for this. I'd neglected to notice the SetEdge impact, this does have a fairly high impact on the API; what was formerly just g.SetEdge(edge{u, v, w}) becomes

if !g.Has(u) { // Possibly g.Has(u.ID())
    g.AddNode(u)
}
if !g.Has(v) { // Possibly g.Has(v.ID())
    g.AddNode(v)
}
g.SetEdge(edge{u.ID(), v.ID(), w})

Maybe it become reasonable to relax the panic on adding nodes to a graph that collide? It would no longer be an addition though, SetNode? The basis for the increased reasonableness is that there is now no longer a reference to the node in the edge except by ID.

The issue comes down to dealing with the problem at the centre of this issue though and what we need to do to fix that. I'm not yet completely happy with any of the solutions, though using ints may be the best so far.

See #45 for the issues around re-setting nodes. The issue that existed was that the adding of a node resulted in the replacement of the edge list for that node. This doesn't need to be the case; we could, if the node exists, just replace the node in the nodes map and leave the edges untouched - deletion would however delete the edges.

I had thought the rest would continue to take graph.Node

It probably can. The issue I have is that for example g.Has(u) is nicer than g.Has(u.ID()), but at the same time g.Has(u.ID()) is more natural than g.Has(simple.Node{n}) if g holds nodes of non-simple.Node type.

I'd neglected to notice the SetEdge impact, this does have a fairly high impact on the API; what was formerly just g.SetEdge(edge{u, v, w}) becomes ...

Not being able to add nodes through SetEdge is probably not a big drawback. graph package itself uses it only in spanning_tree.go but it can be easily avoided because in both Prim and Kruskal all nodes from g will end up in dst. dstarlite and the generators add all necessary nodes in advance. On the other hands, tests seem to be relying on it extensively.

Maybe it become reasonable to relax the panic on adding nodes to a graph that collide? It would no longer be an addition though, SetNode? The basis for the increased reasonableness is that there is now no longer a reference to the node in the edge except by ID.

I would not relax it at the moment.

I'm not yet completely happy with any of the solutions, though using ints may be the best so far.

Agreed.

The issue that existed was that the adding of a node resulted in the replacement of the edge list for that node. This doesn't need to be the case; we could, if the node exists, just replace the node in the nodes map and leave the edges untouched.

Such behaviour seems very natural and reasonable.

deletion would however delete the edges.

That would be also a reasonable and non-surprising behavior.

The issue I have is that for example g.Has(u) is nicer than g.Has(u.ID()), but at the same time g.Has(u.ID()) is more natural than g.Has(simple.Node{n}) if g holds nodes of non-simple.Node type.

Agreed.

On the other hands, tests seem to be relying on it extensively.

The testing code is what I would consider idiomatic usage for client code. So that corpus carries some weight, but I agree, this can be held off by using a helper in the tests that does setEdgeIn(g Graph, u, v graph.Node, w float64) which does the node addition and uses a simple.Edge.

commented

@kortschak, @vladimir-ch It seems like you've both been thinking about this issue for a while, and in doing so have managed to collect a few ideas on how to solve the issues at hand.

As a newcomer, with a potentially fresh set of eyes, I'll try to summarize my understanding of the issues, and their current solutions.

In the original example, information was lost, since the From and To nodes stored in edges were not the original graph.Node instances, but rather distinct graph.Node instances which just happen to use the same node ID. This is a serious issue, as the node ID should uniquely identify a specific graph.Node instance.

With the addition of the Node(id int) graph.Node method, the fix is trivial (although it still does not prevent misuse). Attaching a diff of the original example:

 type node struct {
     id int
     v  string
 }

 func (n node) ID() int { return n.id }

 func main() {
     g := simple.NewDirectedGraph(1, 1)
     g.AddNode(node{id: 1, v: "foo"})
     g.AddNode(node{id: 2, v: "bar"})
-     g.SetEdge(simple.Edge{F: simple.Node(1), T: simple.Node(2)})
+     g.SetEdge(simple.Edge{F: g.Node(1), T: g.Node(2)})
     fmt.Println(g.Edge(simple.Node(1), simple.Node(2)))
 }

Note, only during edge creation is it important to keep track of the original graph.Node instance. For retrieval purposes, even distinct graph.Node instances work correctly, since the only information ever used is the node ID. This is the case with the last line, g.Edge(simple.Node(1), simple.Node(2)). Two distinct graph.Node instances are created simply for retrieval of information, and the Edge method never use the nodes directly, but only indirectly to gain access to the node IDs.

While I agree that it is more convenient to call g.Has(n) rather than g.Has(n.ID()), I do feel that the current API (in lack of better words) encourages potential misuse because of the subtle details of edge handling.

After having come across the gonum graph representation and having gained some familiarity with the various interfaces of the graph package, I felt thrilled to try it out. The interfaces felt very clean and clear while reading the API docs.

So, I tried to find an implementation of the graph interface to try it out, and came across the simple graph. Having since looked at its source code, much of the magic became clear, but I still remember my initial few gotchas with the API. To find a user of the API, I took a look at the gen package which generates random graphs, and as it makes heavy use of creating new nodes on the fly (e.g. simple.Node(1)), I came to think this was the idiomatic use case of the API.

However, it did feel strange to me, to keep creating new nodes simply for retrieval operations, when all that was ever required for such retrievals were the node ID. And I had this feeling, even before finding out about the subtleties of how nodes are handled in edges.

From where I stand, the API would signal much clearer intentions if node IDs were used throughout, and node instances (i.e. graph.Node), only when explicitly required. That is node IDs would be used for retrieval operations, and node instances for creation operations, that is node additions to the graph, and edge creation. I still haven't made up my mind If the API should prevent misuse by making it impossible, or simple document known gotchas.

One way to enforce correct behaviour is to check when new edges are added (through g.SetEdge) that their To and From nodes are not distinct graph.Node instances just happening to share an ID with an already existing node, but rather the same node. This would require that graph.Nodes are of Go comparable types.

To sum up, my gut feeling when using the API is that it would be much clearer (as in clearer intentions, not better readability) if node IDs were used instead where ever possible, and graph.Nodes only when required. This would make it clear that if the API requires a graph.Node from you, it should be the graph node of a given ID, not a graph node of said ID. Similarly, when the API only requires information about a node ID, the user would not have to create dummy node instances on the fly simply to make use of retrieval operations.

Well, food for thought.

I wish we may continue to discuss the pros and cons between the different approaches, and that we may eventually end up with an even better graph API, which is consistent and reliable.

Good night guys,

Cheers /u

The other part of this issue is the behaviour on returning edges from undirected graphs. It would be very nice if the graph could return the edge in the order From -> To when asked for edges From(u). This allows the client to immediately know which node is being traversed to from the edge without comparison to the known starting node.

The disadvantage with using a method like Node(int) graph.Node to get the node and add it becomes clear when you see the common use of setting edges without knowing if the nodes exist. Currently a client can set an edge without knowing whether the nodes exist. This is a very nice feature, so any solution would hopefully retain this.

@mewmew Do you have any suggestions for this issue? I would like to get this clarified for API freeze which I guess is coming.

commented

@mewmew Do you have any suggestions for this issue? I would like to get this clarified for API freeze which I guess is coming.

@kortschak Hej Dan! I would like to investigate what the API may look like if we used node IDs for retrieval and graph.Nodes for creation, in an attempt to mitigate the original issue of this thread with subtle node retrieval.

I know it is a rather invasive change to the API, and would require a substantial update of all user code bases. Hopefully, it should be possible to automate this rewrite using gorename and gofmt -r.

The advantage of using node IDs throughout for retrieval is that it prevents the subtle issue of node retrieval by design.

The main issue with this approach seems to be the use case of implicit addition of nodes when edges are created between two nodes, for which one or two of the given node IDs do not yet exist in the graph. (Another issue may be with readability.)

Quick summary on this topic included below:

The issue I have is that for example g.Has(u) is nicer than g.Has(u.ID()), but at the same time g.Has(u.ID()) is more natural than g.Has(simple.Node{n}) if g holds nodes of non-simple.Node type.

Agreed.

Not being able to add nodes through SetEdge is probably not a big drawback. graph package itself uses it only in spanning_tree.go but it can be easily avoided because in both Prim and Kruskal all nodes from g will end up in dst. dstarlite and the generators add all necessary nodes in advance. On the other hands, tests seem to be relying on it extensively.

The testing code is what I would consider idiomatic usage for client code. So that corpus carries some weight, but I agree, this can be held off by using a helper in the tests that does setEdgeIn(g Graph, u, v graph.Node, w float64) which does the node addition and uses a simple.Edge.

I agree with both @vladimir-ch and @kortschak that using g.Has(simple.Node{n}) if g holds nodes of non-simple.Node type, seems like a misuse of the API. Since the test code should be considered idiomatic for client code (this is how I learnt how to use the API, and ran into exactly the suble issue highlighted in this thread), it should not teach users this misuse of the API.

If it is true that we may introduce helper functions is a few dedicated places (such as the test code) to then switch the API into using node IDs for retrieval throughout and graph.Nodes for creation, I would prefer this approach.

Could we try to create a node ID branch and just try out what the API may look like if we decided to update the API before the freeze? I would be happy to try and flesh out an initial implementation of this, if there is consensus that the effort to test such an API may be worthwhile.

Any input would be much appreciated. I look forward to having a stable graph representation in Go, and I would love for it to be as clean, solid as possible, and remove potential misuse by design.

Cheers /u

I'm very happy for you to try out an alternative API in a branch; this is how the current API was designed.

commented

I'm very happy for you to try out an alternative API in a branch; this is how the current API was designed.

Glad to hear.

I have started this work in the nodeID branch at https://github.com/mewpull/graph/tree/nodeID

As outlined in this issue, the intention of the nodeID branch is to investigate what the graph API would look like if we were to use node IDs for retrieval operations and graph.Nodes for creation in graphs.

Note, commit mewpull@c7ef4c5 does not yet compile, but is in an intermediate
state. The intention is to continue working on it within the next
few weeks, and to open up for a discussion of the API; specifically
regarding if we wish to introduce a dedicated graph.NodeID type, or
simply use int64 to distinguish node IDs. Both approaches have
benefits and drawbacks. Lets discuss them in this issue.

Main changes to the API highlighted below.

// Node is a graph node.
type Node interface {
	// ID returns a graph-unique integer ID of the graph node.
	ID() NodeID
}

// NodeID is a graph-unique integer ID of a graph node.
type NodeID int

// Graph is a generalized graph.
type Graph interface {
	// Has reports whether the node exists within the graph.
	Has(id NodeID) bool

	// Nodes returns all the nodes in the graph.
	Nodes() []Node

	// From returns all nodes that can be reached directly
	// from the given node.
	From(id NodeID) []Node

	// HasEdgeBeteen reports whether an edge exists between
	// nodes x and y without considering direction.
	HasEdgeBetween(x, y NodeID) bool

	// Edge returns the edge from u to v if such an edge
	// exists and nil otherwise. The node v must be directly
	// reachable from u as defined by the From method.
	Edge(u, v NodeID) Edge
}

// Undirected is an undirected graph.
type Undirected interface {
	Graph

	// EdgeBetween returns the edge between nodes x and y.
	EdgeBetween(x, y NodeID) Edge
}

// Directed is a directed graph.
type Directed interface {
	Graph

	// HasEdgeFromTo reports whether an edge exists
	// in the graph from u to v.
	HasEdgeFromTo(u, v NodeID) bool

	// To returns all nodes that can reach directly
	// to the given node.
	To(NodeID) []Node
}

// NodeAdder is an interface for adding arbitrary nodes to a graph.
type NodeAdder interface {
	// NewNodeID returns a new unique arbitrary ID.
	NewNodeID() NodeID

	// Adds a node to the graph. AddNode panics if
	// the added node ID matches an existing node ID.
	AddNode(Node)
}

// NodeRemover is an interface for removing nodes from a graph.
type NodeRemover interface {
	// RemoveNode removes a node from the graph, as
	// well as any edges attached to it. If the node
	// is not in the graph it is a no-op.
	RemoveNode(NodeID)
}

The two methods RemoveNode(NodeID) and AddNode(Node) serve as canonical examples for when node IDs are used vs. nodes. The RemoveNode method is a retrieval or access operation, which locates a node and performs an action on it. To locate this node only one piece of information is required, namely the node ID. The AddNode method of the other hand is used for node creation, and thus requires full access to the node and its associated data, as is captured by the Node type.

I would like to open up for a discussion on if this is a good distinction of when to use node IDs vs. nodes, and if the API makes sense. I would be more than happy to continue working on the nodeID branch, after getting some initial feedback.

So, what do you think?

Cheers /u

My main question is why it is necessary to have a NodeID type? This limits some things that are currently possible.

commented

My main question is why it is necessary to have a NodeID type?

The intention was to help prevent misuse of the API. The original rationale was something along the lines of the only creation of node IDs should be through graph.NewNodeID, since that is the only way to ensure unique IDs. Giving the node ID a dedicated type was intended to limit the risk of non-unique IDs being created.

Thinking more carefully about it, I would agree that it may be over cautious and also not the right approach.

I'll update the API to use int64 instead of NodeID throughout. I still think the node ID should be int64 rather than int as that would make it possible to have more than 2^32 nodes, a use-case that may not be common but definitely something we should support.

int -> int64 maybe seems reasonable. It is conceivable (though not implemented) that people using 32 bit architectures will use an out of memory backing store that has space for more than 1<<31 elements.

The main issue with a new named type for node ID is that at some stage I suspect that edges will get IDs (I think this may be the only sensible way to handle multiple edge mutation). This brings with it the possibility that edges in one graph can be nodes in another. This kind of edge annotation is used elsewhere and allows very rich data modelling.

commented

The main issue with a new named type for node ID is that at some stage I suspect that edges will get IDs (I think this may be the only sensible way to handle multiple edge mutation). This brings with it the possibility that edges in one graph can be nodes in another. This kind of edge annotation is used elsewhere and allows very rich data modelling.

Yes, while playing with the new API for node IDs, it also appeared natural to me that an edge ID would make sense at some point, as that would unify edge retrieval with node retrieval operations.

commented

Now the NodeID has been updated to int64 in mewpull@7dbf5e3

@mewmew Please try to port this work over to gonum/gonum where I intend to continue graph development.

I've just had a look at some of the changes here and from the brief survey, I think you are doing some unsafe things with integer conversion. We use intsets.Sparse for a lot of things, which take int. If we are using int64 for IDs we cannot use that, and we cannot just int(id) since that gives us silent collisions. We can fork intsets, but that gives us an additional maintenance burden that I don't really want. The other alternatives are to generate an int64 version of intsets, or revert to map[int64]V for our set handling.

commented

The other alternatives are to generate an int64 version of intsets, or revert to map[int64]V for our set handling.

Yes, I did think of these issues when writing the code (sorry for not documenting or pointing them out in comments). The path forwards would definitely be to generate an int64 version of intsets (or use map[int64]V for the time being).

I'll take a look at porting the work over to gonum/gonum and will open up a new PR on that repo. Have a busy couple of weeks ahead, but after these weeks, I should have a lot of time on my hands for hacking on and playing with these things : )

Before you start the port, please open an issue for the changes in gonum/graph. The changes are fairly wide-reaching depending on what is done and the order/priority needs to be thought through carefully.

commented

Before you start the port, please open an issue for the changes in gonum/graph. The changes are fairly wide-reaching depending on what is done and the order/priority needs to be thought through carefully.

Sure. I've created gonum/gonum#31 so that we may discuss this in further detail, and reach a consensus of how the graph API may look like.