statnet / network

Classes for Relational Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should accessing non-existent attributes throw an error?

mbojan opened this issue · comments

At this moment one gets vector of NAs silently:

data(emon)
emon$Texas %v% "no.such.attribute"
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

I would prefer an explicit error to a silent error

I agree that silent NAs aren't the way to go.

An alternative to throwing an error would be to follow the behavior of other recursive R objects and return NULL.

c(atomic_a = 1, atomic_b = 2)["no_such_name"]
#> <NA> 
#>   NA
list(recursive_a = 1:10, recursive_b = c("a", "b", "c"))$no_such_name
#> NULL
data.frame(recursive_a = 1:3, recursive_b = c("a", "b", "c"))$no_such_name
#> NULL

My humble suggestion would be to consider what idiomatic code that leverages statnet should look like. Here are some potential patterns as food for thought:

suppressPackageStartupMessages(library(network))

data(emon)

do_something_with_attr1 <- function(x, attr_name, action = identity) {
  if (attr_name %in% list.vertex.attributes(x)) {
    action(x %v% attr_name)
  } else {
    stop("`attr_name` doesn't exist.")
  }
}

do_something_with_attr2 <- function(x, attr_name, action = identity) {
  if (attr_name %in% list.vertex.attributes(x)) {
    action(x %v% attr_name)
  } else {
    NULL
  }
}

do_somthing_else <- function() "did something else"



tryCatch(
  do_something_with_attr1(emon$Texas, "fake_attr"),
  error = function(e) do_somthing_else()
)
#> [1] "did something else"

if (is.null(result <- do_something_with_attr2(emon$Texas, "fake_attr"))) {
  do_somthing_else()
} else {
  result
}
#> [1] "did something else"


`%||%` <- function(lhs, rhs) if (length(lhs)) lhs else rhs # somewhat common idiom

do_something_with_attr2(emon$Texas, "fake_attr") %||% do_somthing_else()
#> [1] "did something else"

More variants from around the CRAN:

data.frame(x=1:5)$y
#> NULL
tibble::tibble(x=1:5)$y
#> Warning: Unknown or uninitialised column: `y`.
#> NULL
dplyr::pull(data.frame(x=1:5), "y")
#> Error: Can't extract columns that don't exist.
#> x Column `y` doesn't exist.

I guess we could have get.*.attribute(net, "no.such.attribute") return NULL with a warning.

Perhaps it would make sense to have sugar functions has_[vertex/edge/network]_attribute(net, "attrname") to return FALSE or TRUE if attribute "attrname" is in fact defined in net.

As a side note I think this should throw an error instead of a vacuous mixing matrix:

data(emon)
network::mixingmatrix(emon$MtSi, "no.such.attribute")
#>        To
#> From    Total
#>   Total     0

Thanks @knapply , something to think about along the way.

Core behaviors should only be changed if there is an extremely compelling reason to do so - such changes break code, and when you are talking about behaviors that have been in place for many, many years, that's a lot of code. I am not seeing any compelling arguments here.

As a user who doesn't expect this behavior, and would at least like an option to change it, would you be opposed to leaving the defaults in place and adding an argument?

@mbojan what is a "sugar" function?

@martinamorris so long as we design 'em in a way that doesn't cause problems down the line or introduce performance penalties, I'm not inherently opposed to the idea of adding some kind of "strict" option that turns on checking. (We actually have something like that with edge additions. It's too expensive to check compliance w/nominal network properties by default, but it can be enforced if one wants to do so.) It wouldn't be possible to do that with the shortcut operators, but could be done with the get.* commands.

The thing, however, is that every change to those core functions adds more opportunities for things to go wrong, for CRAN to decide to become retroactively angry about something, etc. Especially for packages like network, the goal should be to keep the package essentially frozen as much as possible. Developers should be able to count on the idea that it rarely ever changes, and code they write today using that package should still work flawlessly in 10-20 years. (Stuff that needs to be more fluid should then go in packages that sit on top of the core libraries - users can then decide how much they want to make use of whatever the currently popular schemes are, versus building on directly on the more foundational code.) Adding complexity tends to undermine that goal. Sometimes it is necessary, but it should be done sparingly...and reluctantly. ;-)

@mbojan what is a "sugar" function?

A non-essential addition that spares the user a line or two of an often-needed computation, e.g. attrname %in% list.vertex.attributes(net) in this case.

I see @CarterButts 's point here. It is a bit of a game of Yenga -- change one thing and everything might collapse... Still, I think it is worthwhile to make changes so that there is more unequivocal feedback to the user in cases like the OP. I ran into this problem when writing tests for mixingmatrix() -- made a typo in attribute name and could not figure out why I'm getting crazy results, but no errors...

Perhaps we could leave the value returned as is and add a warning if the attribute is not found? Needs some testing though.

Seems like there was a warning if a user tried accessing a non-existent attribute, but it is commented-out:

#if(!(attrname %in% list.vertex.attributes(x)))

Quick investigation shows:

13974b93 (Skye Bender-deMoll 2013-04-24 03:10:32 +0000 1106)   #if(!(attrname %in% list.vertex.attributes(x))) 
13974b93 (Skye Bender-deMoll 2013-04-24 03:10:32 +0000 1107)   #  warning(paste('attribute', attrname,'is not specified for these vertices'))

@skyebend , do you remember why this was commented-out?

My guess is that when we tried implementing the warning behavior we discovered that something else was depending on the 'silently returning NAs behavior' and it started throwing way too many errors. Worth trying again tho.

Running 'ergm' tests against a version of 'network' with the vertex attribute presence check uncommented gives indeed some warnings. All of them are about vertex.names not present in the object (which BTW seems not to be enforced, contrary to the JoSS paper, as @knapply noticed in #43). I will identify the number of places in the 'ergm' code that it actually happens. I still think this is a worthwhile modification (along with a analogous one for edge and network attributes).

Test log
	==> devtools::test()

	Loading ergm
	Loading required package: network
	network: Classes for Relational Data
	Version 1.17.0-585 created on 2020-10-08.
	copyright (c) 2005, Carter T. Butts, University of California-Irvine
			    Mark S. Handcock, University of California -- Los Angeles
			    David R. Hunter, Penn State University
			    Martina Morris, University of Washington
			    Skye Bender-deMoll, University of Washington
	 For citation information, type citation("network").
	 Type help("network-package") to get started.


	ergm: version 3.11.0-5793, created on 2020-10-08
	Copyright (c) 2020, Mark S. Handcock, University of California -- Los Angeles
			    David R. Hunter, Penn State University
			    Carter T. Butts, University of California -- Irvine
			    Steven M. Goodreau, University of Washington
			    Pavel N. Krivitsky, UNSW Sydney
			    Martina Morris, University of Washington
			    with contributions from
			    Li Wang
			    Kirk Li, University of Washington
			    Skye Bender-deMoll, University of Washington
			    Chad Klumb
			    Michał Bojanowski, Kozminski University
			    Ben Bolker
	Based on "statnet" project software (statnet.org).
	For license and citation information see statnet.org/attribution
	or type citation("ergm").

	NOTE: Versions before 3.6.1 had a bug in the implementation of the bd()
	constraint which distorted the sampled distribution somewhat. In
	addition, Sampson's Monks datasets had mislabeled vertices. See the
	NEWS and the documentation for more details.

	NOTE: Some common term arguments pertaining to vertex attribute and
	level selection have changed in 3.10.0. See terms help for more
	details. Use ‘options(ergm.term=list(version="3.9.4"))’ to use old
	behavior.

	Testing ergm
	✓ |  OK F W S | Context
	✓ |   4   2   | bd [4.2 s]
	──────────────────────────────────────────────────────────────────────────────────────────────
	test-bd.R:27: warning: Bounded degree (bd()) maximum constraint for undirected networks
	Vector(s)  do not vary but equal mu0; they have been ignored for the purposes of testing.

	test-bd.R:36: warning: Bounded degree (bd()) constraints for directed networks
	Vector(s)  do not vary but equal mu0; they have been ignored for the purposes of testing.
	──────────────────────────────────────────────────────────────────────────────────────────────
	✓ |  42       | bipartite-missing-data [1.7 s]
	✓ |   1       | c-ergm_model [0.2 s]
	✓ |   1       | checkpointing [20.9 s]
	✓ |   8       | ergm-godfather [0.3 s]
	✓ |  40   4   | ergm-san [16.9 s]
	──────────────────────────────────────────────────────────────────────────────────────────────
	test-ergm-san.R:70: warning: san.ergm does not default to offsets in the ergm
	'san.ergm' is deprecated.
	See help("Deprecated")

	test-ergm-san.R:96: warning: SAN works with curved terms
	Model statistics 'offset(degree3)' are linear combinations of some set of preceding statistics at the current stage of the estimation. This may indicate that the model is nonidentifiable.

	test-ergm-san.R:102: warning: SAN works with curved terms
	Model statistics 'offset(degree3)' are linear combinations of some set of preceding statistics at the current stage of the estimation. This may indicate that the model is nonidentifiable.

	test-ergm-san.R:102: warning: SAN works with curved terms
	Model statistics 'offset(degree3)' and 'gwesp' are linear combinations of some set of preceding statistics at the current stage of the estimation. This may indicate that the model is nonidentifiable.
	──────────────────────────────────────────────────────────────────────────────────────────────
	[1] 350  50 250
	Structural check:
	Mean degree: 1.4 .
	Average degree among nodes with degree 2 or higher: 2.25 .
	✓ |   2     1 | mple-target [0.2 s]
	──────────────────────────────────────────────────────────────────────────────────────────────
	test-mple-target.R:30: skip: simulating from the MPLE target statistics fit
	Reason: empty test
	──────────────────────────────────────────────────────────────────────────────────────────────
	✓ |   7       | nonident-test [5.6 s]
	⠏ |   0       | nonunique-namesSample statistics summary:

	Iterations = 1:1024
	Thinning interval = 1 
	Number of chains = 1 
	Sample size per chain = 1024 

	1. Empirical mean and standard deviation for each variable,
	   plus standard error of the mean:

		   Mean    SD Naive SE Time-series SE
	edgecov.a 3.190 2.282  0.07130         0.9579
	edgecov.a 1.024 1.762  0.05505         0.6171

	2. Quantiles for each variable:

		  2.5% 25% 50% 75% 97.5%
	edgecov.a   -2   2   4   5     6
	edgecov.a   -3   0   1   2     4


	Sample statistics cross-correlations:
		  edgecov.a edgecov.a
	edgecov.a 1.0000000 0.8544418
	edgecov.a 0.8544418 1.0000000

	Sample statistics auto-correlation:
	Chain 1 
	      edgecov.a edgecov.a
	Lag 0 1.0000000 1.0000000
	Lag 1 0.9889689 0.9841949
	Lag 2 0.9786889 0.9693349
	Lag 3 0.9685966 0.9551049
	Lag 4 0.9590676 0.9424500
	Lag 5 0.9495386 0.9291651

	Sample statistics burn-in diagnostic (Geweke):
	Chain 1 

	Fraction in 1st window = 0.1
	Fraction in 2nd window = 0.5 

	edgecov.a edgecov.a 
	   -1.250    -1.491 

	Individual P-values (lower = worse):
	edgecov.a edgecov.a 
	0.2113476 0.1359306 
	Joint P-value (lower = worse):  0.8684058 .

	MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
	✓ |   0       | nonunique-names [1.0 s]
	x |  25 1 8   | predict.ergm [1.2 s]
	──────────────────────────────────────────────────────────────────────────────────────────────
	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:176: warning: it works for offsets and non-finite offset coefs
	attribute vertex.names is not specified for these vertices

	test-predict.ergm.R:183: failure: it works for offsets and non-finite offset coefs
	`p <- predict(fit)` produced warnings.
	──────────────────────────────────────────────────────────────────────────────────────────────
	✓ |   2       | term-b12factor
	✓ |   3       | term-edgecov
	✓ |  11   28   | term-mm [0.7 s]
	──────────────────────────────────────────────────────────────────────────────────────────────
	test-term-mm.R:62: warning: Undirected mm() summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:62: warning: Undirected mm() summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:62: warning: Undirected mm() summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:62: warning: Undirected mm() summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:91: warning: Undirected mm() summary with level2 filter
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:91: warning: Undirected mm() summary with level2 filter
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:91: warning: Undirected mm() summary with level2 filter
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:91: warning: Undirected mm() summary with level2 filter
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:102: warning: Undirected mm() marginal summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:102: warning: Undirected mm() marginal summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:102: warning: Undirected mm() marginal summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:102: warning: Undirected mm() marginal summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:107: warning: Undirected mm() marginal summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:107: warning: Undirected mm() marginal summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:107: warning: Undirected mm() marginal summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:107: warning: Undirected mm() marginal summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:112: warning: Undirected mm() summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:112: warning: Undirected mm() summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:112: warning: Undirected mm() summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:112: warning: Undirected mm() summary with fixed levels set
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:117: warning: Undirected valued mm() sum summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:117: warning: Undirected valued mm() sum summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:117: warning: Undirected valued mm() sum summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:117: warning: Undirected valued mm() sum summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:124: warning: Undirected valued mm() nonzero summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:124: warning: Undirected valued mm() nonzero summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:124: warning: Undirected valued mm() nonzero summary
	attribute vertex.names is not specified for these vertices

	test-term-mm.R:124: warning: Undirected valued mm() nonzero summary
	attribute vertex.names is not specified for these vertices
	──────────────────────────────────────────────────────────────────────────────────────────────
	✓ |   8     1 | term-options [7.3 s]
	──────────────────────────────────────────────────────────────────────────────────────────────
	test-term-options.R:31: skip: gof() of a model that had term options
	Reason: empty test
	──────────────────────────────────────────────────────────────────────────────────────────────
	✓ |  14       | valued-terms [10.2 s]

	══ Results ═══════════════════════════════════════════════════════════════════════════════════
	Duration: 71.8 s

	OK:       168
	Failed:   1
	Warnings: 42
	Skipped:  2
session log ``` > sessioninfo::session_info() ─ Session info ───────────────────────────────────────────────────────────────────────────── setting value version R version 4.0.3 (2020-10-10) os Ubuntu 18.04.5 LTS system x86_64, linux-gnu ui RStudio language en_US collate en_US.UTF-8 ctype en_US.UTF-8 tz Europe/Warsaw date 2020-10-11
─ Packages ─────────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 network       1.16.1  2020-10-07 [1] CRAN (R 4.0.2)
 pillar        1.4.6   2020-07-10 [1] CRAN (R 4.0.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 rlang         0.4.8   2020-10-08 [1] CRAN (R 4.0.2)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
 tibble        3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
 vctrs         0.3.4   2020-08-29 [1] CRAN (R 4.0.2)
 withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)

[1] /home/mbojan/R/library/4.0
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
</details>


Updated network.vertex.names(), the behavior is identical as earlier. All related warnings in 'ergm' tests went away.

Close as soon as #48 is merged.

@krivit wrote:

@mbojan , that's not quite what I meant. The implementation of has.*.attribute() functions merely calls the corresponding list.*.attributes(), which are all generics. This means that if the list.*.attributes.<CLASS>() is implemented, then has.*.attribute() should work fine for <CLASS>. Thus, it makes sense to call the implementation of has.*.attribute.network() has.*.attribute.default().

Ah, OK. Perhaps it will be cleaner still to have, say, has.vertex.attribute.network() call the S3 method list.vertex.attributes.network() directly so that the S3 dispatch is done only once and in the beginning?

@mbojan i've just pulled the new version, and it is generating some unexpected messages. see https://github.com/statnet/WHAMP2/issues/28#issuecomment-707379121

Hmm, interesting. I ran all the ergm tests and they were silent. In ergm there is

$ git grep -n get.edge.attribute
R/InitErgmTerm.R:2401:    if( is.null(a$attrname) || is.null(get.edge.attribute(a$x,a$attrname))){

which could probably just use the freshly baked has.edge.attribute() and the warning should go away... I will also make the warning text more informative.

@martinamorris , can you please try re-running with options(warn=3) and then post the result of running traceback() after the error?

@martinamorris I believe I identified all the places were the warnings are triggered. I'm looking into ironing them out. The warnings should be harmless, although annoying, because the values returned by get.[vertex/edge/network].attribute() have not been changed.

my use case was indeed an ergm.ego fit.

Did adding the additional check to see if attributes exist when querying add much time overhead in situations where we do lots of reads?

Not sure if this is related to @skyebend's comment, but with the last commit, my gof ~ model command took so long i cancelled it. It was putting out all of the incorrect warnings noted above, which may have been slowing it down. but it usually took a minute to run this command, and this time I cut it off after 10 min.

Did adding the additional check to see if attributes exist when querying add much time overhead in situations where we do lots of reads?

Thanks @skyebend , may be. I have to check. I'd be surprised though.

@martinamorris , it is also possible the warnings buffer is getting a beating. Until I solve and test this, and unless you need as.data.frame.network() I think it may make sense for you not to use network from master (perhaps my PR was too soon). but from v1.16.1 branch Carter created for the patch release. Basically remotes::install_github("statnet/network", ref="v1.16.1").

If all of this turns out (even more) problematic I will revert the merge that introduced the warning and keep on tinkering on a branch off master.

Joining this conversation just now, after spending some time testing the new code with ergm 3.11 RC. My observations:

  • I think that, in the end, warning about nonexistent attributes is a poor compromise. Even though it doesn't "break" code in the form of causing errors, it it creates enough spurious warnings that a lot of code downstream has to be modified, which is functionally the same (especially since some tests for CRAN rely on having a specific set of warnings).
  • For most applications, an attribute not existing and attribute existing but being set to NULL are equivalent conditions. The most common pattern for accessing a network attribute is probably something like the following:
    a <- x%n%"a"
    if(is.null(a)) <handle missing attribute>
    else <use a>
    Making the user either check in advance or suppress warnings makes things more onerous in most cases.
  • has.network.attribute(x,a) is far more verbose than !is.null(x%n%a). If we want to encourage people to check, it shouldn't be, so perhaps we should have an operation such as x%has_n%a (with the underscore possibly being a dot or just nothing) as a shortcut for it.
  • IMO, returning an NA vector for nonexistent vertex attributes, as opposed to NULL, was a mistake, one which may be too late to fix. That said, precisely because this is a "weird" behaviour, I suspect that nobody actually relied on it downstream, so perhaps we should change it after doing a lot of reverse-dependency checking and make nonexistent attributes consistently return NULL. (I am completely certain that this change would require fewer total changes to downstream code than the warnings change.)

Ah, OK. Perhaps it will be cleaner still to have, say, has.vertex.attribute.network() call the S3 method list.vertex.attributes.network() directly so that the S3 dispatch is done only once and in the beginning?

I don't think those dispatches slow things down that much in practice, though there's only one way to find out (benchmarking).

The new warnings may also have broken EpiModel (EpiModel/EpiModel#456 (comment))

Also, they appear to break vignette building for ergm 3.11 RC.

@krivit @chad-klumb and others, let's git revert the changes introduced by the commits merged with f457d80 (please no force-pushing!). I'll work off the main line. I can do it tomorrow afternoon (CET) unless somebody beats me to it.

These #41 (comment) are all good points. I think we will be better-off in the long term.

@krivit Re: returning NAs instead of NULLs, here's one reason we do it that way:

'''

g<-emon[[1]]
set.vertex.attribute(g,attrname="zoo",value=3,v=3)
g%v%"zoo"
[1] NA NA 3 NA NA NA NA NA NA NA NA NA NA NA
'''

The thing that must be remembered is that vertex attributes do not "belong" to the network...they belong to the specific vertices in question. In the above example, vertex 3 has the attribute "zoo," while the other vertices do not. By design, every vertex has its own vector of attributes, and every edge has its own vector of attributes; there is no requirement that different elements of the object contain the same attributes (other than the required ones). So when you write g%v%"zoo", you are really asking for the access method to go ask each vertex if it has the attribute "zoo," and if so to return its value. If the attribute isn't present, it is unobserved and hence missing. NULLs cannot be used in that way, because you can't put NULLs in a standard vector (they are proxies for actual null pointers in the backend). We could deal with those cases by returning lists when one or more attributes are not present and using NULLs for the missing elements, but I don't think anyone wants to do that (at least not for the shortcut methods).

This is also related to the question of performance costs on attribute checks. I do think some benchmarks are going to be important here, because it can be expensive to check for variable presence - the cost for vertex attributes is order Na, where N is the number of vertices and a the maximum number of vertex attributes on any given vertex, and for edge attributes the cost is order Mb, where M is the number of edges and b is the maximum number of attributes on any given edge. In sparse graphs, M is order N, so this is not so bad...but it can get expensive in dense graphs. Likewise, both operations are in actual practice pretty cheap if you are doing them only rarely (e.g., when running summary()), but not so cheap that you want to do them very often.

One thing that has been contemplated before is creating an attribute registry, that maintains an index of what attributes are defined and who has them. This has never seemed worthwhile, because it adds a lot of code to solve what has historically been a non-problem (and will add at least N+M to the storage cost of the network object). If we became committed to checking attributes very frequently (especially for edges), I think we'd have to revisit that idea. But I don't think there is at present a strong argument for doing more frequent checks in the first place; if you are actually querying an attribute with get.*.attribute, you are already doing a check (so that's already priced in), and otherwise there aren't many cases where they are needed....so this is quite hypothetical.

I would observe that this thread began with an aesthetic objection to a long-standard behavior that was not AFAIK shown to cause any actual problems, and then led to the expenditure of some non-zero level of effort, culminating in the creation of new problems that did not exist when the thread began. Perhaps it would be more productive to focus attention instead on things that are definitely broken and/or in need of an upgrade?

@CarterButts , thanks for the explanation. Regarding attribute registry, if all we want to track is whether a particular vertex or edge attribute is present, an O(M) representation is to count the number of vertices/edges that have that attribute, being incremented and decremented as they are set and deleted. Then, a look-up is simply checking if a particular usage count is not 0.

@krivit True, if all we want to know is whether something has the attribute - that would be the most minimal way. But it does still require all creation/set methods to respect the registry. It also means that vertex/edge addition/deletion methods must also update the registry. So it could certainly be done, but it would involve a lot of reengineering. (Also, existing network objects would no longer be up to spec.) Seems like something to have in back of mind if we ever get to the point where we really need to be doing a lot of attribute checking (to the point where we're taking a performance hit).

@CarterButts but we did learn a lot about how this all fits together! :-) Thanks @mbojan for trying it out. I apologize that I didn't leave clear enough comments before (or maybe we lost them in migration). If we roll back, lets leave an artifact in the code to remind us.

Thanks @CarterButts . That explains a lot.

I reverted the merge and added a cautionary note to future generations of Statnet developers.

(Above commit messages, apart from the last, are artifacts of me collecting all the changes I've made on a separate branch (https://github.com/mbojan/network/tree/i41-original))

Do we still would like to have anything like has.[vertex|edge|network].attribute() and possibly a binary operator version such as %has_v% or similar for testing if attribute is present for any vertex/edge or the network? You can react to this post with

  • 👍 for "Yes" or
  • 👎 for "No"

I will close this issue after that.

I'll comment rather than vote. This has been a master class in the complexity of simple changes, in this highly interdependent ecosystem.

Yes in principle -- it'd be nice to be warned when you've referenced a non-existent attribute. I don't see that as pandering or nannying. It's just a service.

But, I also agree with Carter's many points about stability of core tools, performance considerations and the importance of a consistent, principled spec at this level. It's also clear that this change has many implications, so would not be as quick to implement as we might have thought.

Increasingly, I'm thinking it might be best to develop utilities/enhancements like this in a separate package (say, network.utilities) that users could adopt, or not, as they chose. That would keep the core development pathway clean, and easier to maintain, and would actually be a better model for incorporating contributions from new folks, without jeopardizing the consistent functionality of the core tools. I don't know if that's possible in practice -- it clearly would be for, say, @knapply's contributions, I'm not sure whether the has.attribute() could be implemented that way.

@mbojan I definitely do like the idea of having methods to query the attribute sets to see if an attribute is defined somewhere. It's a natural thing to want to do, and currently it is done manually where it is needed. We don't need to change current code, but having some query methods in place means that we can integrate them where they make sense. They also provide a future path in case we wind up e.g. adding a registry or some such at a future time, and they provide users with a tool that they may find helpful (without introducing any other costs or dependencies). So that seems like a reasonable and appropriate addition to core functionality. It's worth thinking a bit on the name of the shortcut operator: one wants them to be very terse. Checking to see if something is in a set is usually done with %in%, so riffing off of that could make sense. %inv% seems like "inverse," though, and I get squirmy at putting underscores in things, e.g. %in_v%. (Back in my day, the underscore was an assignment operator in R/S, and I've never fully recovered. :-)) We have to live with whatever decision we make - forever - so it is probably worth contemplation....

@martinamorris I definitely second the approach of building tools/utilities packages that sit on the core libraries. That's what we did with networkDynamic, for instance, and I think it gives us a lot of flexibility to add more problem-specific functionality - and stuff that can evolve more rapidly as users' demands change - while being able to rely on the foundation staying in one place. Something like has.*.attribute seems like it is potentially low-level enough to live in the core (a good heuristic being that if we changed how attributes are handled internally, it would have to change, which is not true for some other enhancements), but some other utilities could easily live "on top" of the base package.

Also, collectively, I apologize for not pointing out earlier the specific detail about attributes being allocated to specific sub-objects, and the implications of that for this set of features. For whatever reason, it didn't dawn on me that that this was a key point of confusion until later in the process! (But this is indeed part of why I am so conservative about tinkering with the network package, and why I get concerned when I see changes being made whose implications I cannot easily grok. Even the wise cannot see all ends. I'm not that wise, but I have been burned enough times to know that the ends I cannot see have a way of coming back to haunt us. :-))

%inva% vertex attributes?

I see "inv", and I think "inverse".

While I understand the relation to %in% I seem to like the possessive "has". How about one of:

  1. no "separator" -- vertex: %hasv%, edge: %hase% and network %hasn%
  2. with a dot -- vertex: %has.v%, edge: %has.e% and network %has.n%
  3. with an underscore -- vertex: %has_v%, edge: %has_e% and network %has_n%

Other ideas?

so, after a bit of noodling around in network, this is what i see:

library(network)
data(flo)
flonet <- network(flo)
list.vertex.attributes(flonet)
> [1] "na"           "vertex.names"
"vertex.names" %in% list.vertex.attributes(flonet)
[1] TRUE

which leaves me with the question, do we need more than this? i can imagine it might be nice to have something shorter, like

"vertex.names" %in% flonet

if it is possible to use the existing binary matching operator this way -- to have it look for a match to any of the attribute types -- vertex, edge or network -- and print out the results, that would be great.

if not, then is it possible to have a single operator like %has% -- again, not requiring the user to specify which type of attribute in the query?

if not, then i'd say we should just drop the binary operator idea, as we already have a function that does this.

I don't think there is a good way of using any existing operator in this way (e.g. base %in% is not generic).

I'm not sure how a single new operator like %has% could work. Which attribute names should it query? What if there is a vertex attribute "a" but I want to check if there is an edge attribute with the same name?

I'm not a big fan of using binary operators in these cases anyway... Still, the has.*.attribute() functions are IMHO worthwhile.

I guess in principle my question boils down to choosing one from:

  1. Add has.[vertex|edge|network].attribute() functions and corresponding binary operators.
  2. Add has.[vertex|edge|network].attribute() functions only.
  3. We are happy with how the things are.

I agree the use case, and thus purpose, of this function should be spec'd out -- and then compared to existing functionality.

The use case I had in mind was really minimal, so the purpose of collapsing all of the attribute types (by not referring to them explicitly) was just to make it REALLY EASY to type the query. I had in mind that if the "attrname" queried showed up in any of list.vertex.attributes, list.edge.attributes or list.network.attributes, it would return the list(s) that the "attrname" was found in.

I think exploration of the distribution of NA's at the v & e levels should just be done using existing technology (tm).

@martinamorris @CarterButts thanks for these points. My initial and only usecase was to make testing for attribute presence more streamlined than with %in%.

if(has.network.attribute(net, "attrname")) {
  # do stuff
}

Perhaps extending a bit in case of vertex and edge attributes we might want to check particular vertex or edge, i.e.:

# test if vertex 1 has attribute "attrname" defined
if(has.vertex.attribute(net, "attrname", v = 1)) # ...

It is definitely not hard with what already is in the package. My initial point was motivated not so much that it is hard, but rather by the fact that a common and often needed computation (testing for attribute presence) should perhaps be a dedicated function.

Of the kinds of queries, there could be:

  • "any" - i.e., does any vertex/edge have the attribute
  • "all" - i.e., do all of the vertices/edges have the attribute
  • "which" - i.e., return a data structure indicating which vertices or edges have the attribute

That seems useful. Although it's something else from what I had in mind. Indeed, if we have an idea of context(s) that one might need such querying it is worth implementing. I think this comes very close to a more general need of graph indexing/subscripting: accessing vertices/edges by their attributes, accessing vertices by (attributes of) edges they are incident on, accessing edges by (attributes of) vertices they are incident on, etc. Something like V() and E() in igraph.

Alas, we are drifting off the initial topic. Unless opposed, I will close this issue and we can open new ones.