peekxc / Mapper

R package for computing the Mapper construction from Topological Data Analysis

Home Page:http://peekxc.github.io/Mapper/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ball Cover Functionality

20choudharysa opened this issue · comments

Hi again,

@corybrunson, @20demkowiczbr and I have been using the BallCover and we are having difficulty understanding its functionality. We created a data set containing three points (unit distance apart) and performed three mapper constructions with three different epsilon values (0.5, 1, 1.5). We are unsure as to why with an epsilon value of 1 and 1.5, the simplicial complex obtained is a 0-simplex.

Thanks

library(simplextree)
library(Mapper)

x <- rbind(c(1,1),c(2,1),c(3,1))
f_x <- matrix(x[ ,1])

m <- MapperRef$new(X = x)$
  use_filter(filter= f_x)$
  use_cover(cover = "ball", epsilon = 0.5)$
  use_clustering_algorithm(cl = "single")$
  construct_k_skeleton(k = 1L)
m$simplicial_complex
#> Simplex Tree with (3) (0)-simplices

m <- MapperRef$new(X = x)$
  use_filter(filter= f_x)$
  use_cover(cover = "ball", epsilon = 1)$
  use_clustering_algorithm(cl = "single")$
  construct_k_skeleton(k = 1L)
m$simplicial_complex
#> Simplex Tree with (1) (0)-simplices

m <- MapperRef$new(X = x)$
  use_filter(filter= f_x)$
  use_cover(cover = "ball", epsilon = 1.5)$
  use_clustering_algorithm(cl = "single")$
  construct_k_skeleton(k = 1L)
m$simplicial_complex
#> Simplex Tree with (1) (0)-simplices

Created on 2019-08-06 by the reprex package (v0.3.0)

The BallCover creates a ball of radius epsilon around every point. Then, all balls with a nonempty intersection are unioned. The set of (unioned) balls forms the set of open sets.

So when epsilon = 0.5, you get 3 open sets, since there are no intersections. When epsilon > 1 in your case, there's only one open set, which may or may not result in a single 0-simplex depending on the clustering algorithm.

I can add more documentation on that.

Is there a specific kind of functionality you're looking for? I can always add other covers.

Thank you! That definitely makes more sense.

@corybrunson, could you please chime in about what our original thoughts were?

@20choudharysa thanks for flagging me.

@peekxc thanks for the explanation. If i understand it correctly, the cover sets are then determined (not algorithmically, just mathematically) by single-linkage clustering on the lensed data, using the epsilon parameter. Is that right? This was one interpretation we thought of after generating some results.

Our expectations were different: We expected the set of balls, one for each data point, to be the cover. This would be like performing this construction by Dłotko, except without bothering to find a more practical subset of the original points whose balls still covered the whole data set.

As discussed there, a key feature of the resulting cover is that the sets overlap. As i understand BallCover(), though, the cover consists of non-overlapping sets. If that is indeed the case, then how is it possible for the mapper construction to produce simplices of dimension greater than zero? Also, if you have a favorite example using this cover, we'd be interested to see it! Thanks again.

@peekxc thanks for the explanation. If i understand it correctly, the cover sets are then determined (not algorithmically, just mathematically) by single-linkage clustering on the lensed data, using the epsilon parameter. Is that right? This was one interpretation we thought of after generating some results.

Yep, that seems right, the balls centered at the points in the same connected components of the SL/eps-neighborhood graph/VR-complex/etc. forms the sets

As discussed there, a key feature of the resulting cover is that the sets overlap. As i understand BallCover(), though, the cover consists of non-overlapping sets. If that is indeed the case, then how is it possible for the mapper construction to produce simplices of dimension greater than zero?

I originally hastily made this cover after reading some notes from Carlsson (somewhere...) on Mapper where I guess originally in Mappers design they tried a couple different types of covers, one of which included a ball cover. I had some issues however in implementation, at the time I wasn't sure how to determine where to place the eps-balls while guaranteeing their union forms a valid cover. I ended up just placing a ball at each point, which as you pointed out will only produce -simplices, and yes this seems not too desirable aside from the pure simplification (I haven't done too many experiments). My work focuses on the fixed interval type of covers, so that's where I've spent most of my time.

In retrospect, perhaps a better strategy would be to add an additional parameter which allows for overlap. I'm open to suggestions on this. One idea would be to iteratively choose landmark points (see example usage in the new vignette here) to form eps-balls around, continuing until every point is within eps or at least one landmark, or the user could specify the number of landmarks.

This would be like performing this construction by Dłotko, except without bothering to find a more practical subset of the original points whose balls still covered the whole data set.

Seems fairly straightforward. I'll look into this now.

Oh neat, i didn't know about that part of the origin story.

I very much like the idea of following Dłotko's procedure. Their generation of epsilon nets seems to me to be equivalent to the generation of landmark points, on the assumption that "minimizes" should be "maximizes" in your vignette (line 128 in the .Rmd file)? [Edit: I was ignoring the "argmax" part; i think it's clear what's meant from the equation, though the language seems a bit off.]

Their generation of epsilon nets seems to me to be equivalent to the generation of landmark points

Yeah actually I was just thinking that, trying to determine if they are indeed in the same.

I'm a little confused by what dist is in Algorithm 2 of Dłotko's paper, since C is a set. If it's the minimum from p to any points in the set, then it seems equivalent.

I used the argmax to summarize section 2.3 of this.

That's what i'm trained to think of as the distance from a point to a set. : ) So, yes, i believe they're equivalent (which of course makes sense given that they're trying to accomplish the same thing).

Gotcha I could reformat the Ball cover to do something like that (sorry I'm new to things like eps-nets). I'll need to piece together some things first, not sure if I'll end up needing the RANN package dependency for this still or not.

Edit: It might be a few days before I can get to this. Feel free to submit a PR @20choudharysa if you want

A good introduction to the use of \epsilon-nets is:
"Lectures on Discrete Geometry" by J. Matoušek, see the end of section 13.1.

@peekxc is the ball cover a priority for you? If not, since you've already implemented landmarking, this could be useful exercise on our end. Though it would be some weeks before we get to it.

@corybrunson It's currently 3rd on the priority list, I'm not sure when I can get to it. Shouldn't be too hard if euclidean distance is desired: could just augment landmark procedure to accept either an integer N or a distance epsilon to determine when to stop the calculation of new landmarks. Then the procedure could be used as-is I think, combined with e.g. the RANN package for fixed-radius searching.

My other two priorities:

  1. refactor how distance measures are used in the MapperRef pipeline to allow the use of arbitrary (even C++ compiled user-defined) metrics via proxy registry entries
  2. Really rethink if how to more efficiently streamline the 'neighborhood' design I have now for the covers (which determines which combinations of open sets to compare in computing the nerve). Ideally I'm thinking of a kind of generative, range-based solution like what this library does but I'm not sure how to handle generating and passing around the ranges.

@peekxc I'd be very interested in item 1 as well, especially for the landmark maxmin method. p-norms, or their pth powers would be great.

Just as an FYI-- @corybrunson and I have nearly finished an implementation of the BallCover as discussed above, including the optimization (selection of a "landmark" set) described in the Dłotko article. We'll make that available via PR soon!

@yaraskaf Sounds good, I look forward to it! Once you submit the PR, could you please mark it as editable by me that way I can make it play nice with the upcoming nerve computation? I've made some moderately-sized changes in the upcoming version on the nerve computation to make it more general and efficient to address (2) above, which will affect how the Ball Cover interacts with the Mapper instances.

I did end up modifying the BallCover on the upcoming version to use general metrics (and to actually do something useful, lol). It's not super efficient though; just uses maxmin landmark selection (in C++ for euclidean, otherwise in R), and then proxy's pairwise dist overload. Implementation looks something like (in the cover construction method):

  ## Get filter values 
  fv <- filter()
  
  ## Construct the balls
  eps_lm <- landmarks(fv, eps=self$epsilon, dist_method=self$method)
  
  ## Get distance from each point to landmarks 
  dist_to_lm <- proxy::dist(fv, fv[eps_lm,,drop=FALSE], method = self$method)
  pts_within_eps <- function(lm_dist){ which(lm_dist < self$epsilon) }
  
  ## Construct the index set + the preimages
  index_set <- as.character(eps_lm)
  sets <- structure(as.list(apply(dist_to_lm, 2, pts_within_eps)), names=index_set)
 
  #...

Not sure if this is similar to what you have or completely different?

@mpcasey2718 I can add any specific distance measures to the maxmin procedure. The upcoming version includes support for theoretically any distance measure (even user-supplied) in pure R. Doing the same in C++ requires specifically coding up the other distances needed, which maybe one day I'll code up the major ones in an efficient way like this so that I can do something like this. But first I gotta pass my classes :)

@peekxc Great, thanks for your response! I'll definitely make the PR editable by you.

The basic cover construction is pretty similar to the code you posted, but just to give you a little more detail on the functionality we implemented:

  1. We left the original ball cover as another cover choice "DisjointBallCover" so as not to lose the option to create a cover of non-intersecting sets.
  2. We added the choice to construct the cover using either a fixed epsilon (i.e. cover the space with balls of a fixed radius) or a fixed number of balls (i.e. use whatever epsilon is required to cover the space with the given number of sets). This way, if the user truly wants a ball around each point instead of the minimal set required to cover the space, they can create a cover with the number of balls equal to the number of data points.
  3. We included a few options for choice of the first point in the landmark set-- a random point, the point with the highest eccentricity, or a user specified point. This has some effect on the final cover, though I'm not sure how large that effect is.

Does that sound reasonable to you?

That sounds awesome!