alitouka / spark_dbscan

DBSCAN clustering algorithm on top of Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Keep unique id of a point

lucaventurini opened this issue · comments

Let's say we want to cluster some objects on a subset of their features. We then transform these objects into Points, where the said subset will become the coordinates of the Points. We want to keep track of the remainder of the features not used for the algorithm. How do we proceed?

A possible solution is to keep track of each point by means of a unique identifier. In the current source code I see such a field, but, even if I force it to a value, it is magically transformed in some point of the pre-partioning, and at the end of the algorithm all the identifiers are different, so that no join with the initial dataset is possible.

I think this is a critical issue, if confimed. Joining the result of a clustering with some metadata is the most, if not only, useful postprocessing part to make something of the results of DBSCAN (and any clustering algorithm).

At the moment there is no way to attach metadata to the points. The only available way is to perform a join at the end using the coordinates...

If you want to attach metadata, you need to create a proper field in the Point object and refactor the code, since Points are re-created several times in the algorithm...

I see your point, but a full join is an expensive operation that could be saved if only the id could be kept.
So you confirm the id is refactored during the runs on purpose?

Yes, the pointId you can see there is for internal processing and no metadata storage is allowed so far