round indexes to improve clustering

Question

round indexes to improve clustering

caseyjlaw opened this issue 6 years ago · comments

The time index of a candidate is scaled by dt. Sometimes candidates are found over a range of dt and calculating an index to compare against dt=1 leads to errors on the order of dt (e.g., dt=8 means multiplying integration by 8 and comparing to cands with dt=1). Clustering can miss candidates with large dt due to this misalignment in time.
One possibility is to round indexes by factors of 2 or more (perhaps up to max dt) such that more candidates have the same time index during clustering. This should only be done to help clustering group these events; we don't want to save rounded indexes.

Kshitij Aggarwal · Answer 1 · Mon Oct 29 2018 08:54:50 GMT+0800 (China Standard Time)

That is what I was doing in my original clustering tests as well:
time_ind = np.multiply(integration_ind,np.power(2,dt_ind))
Its better to cluster candidates by the time of occurrence (with respect to the lowest dt), which would be integration*2^(dt_ind), as integration also scales according to dt.

Casey Law · Answer 2 · Sat Dec 15 2018 02:19:19 GMT+0800 (China Standard Time)

Added a downsample argument to rfpipe.candidates.cluster_candidates. Default is to downsample by 2, which should help cluster things that fall at pixel boundaries (spatially, especially).

Casey Law · Answer 3 · Wed Dec 19 2018 04:43:19 GMT+0800 (China Standard Time)

@KshitijAggarwal I am not getting better clustering with this downsampling feature. I wonder if you have an opinion.
I am implementing a simple integer division of the data array going into hdbscan. So where it has a fit(data) method, I am instead doing fit(data//downsample). For downsample=2, it will do integer division, so [0,1,2,3] goes to [0,0,1,1].
The problem is that I end up with many more clusters after downsampling. My naive assumption was that the values would be more similar and thus more clustered. The opposite seems to be true.
Any ideas why that happens?

Kshitij Aggarwal · Answer 4 · Wed Dec 19 2018 22:29:16 GMT+0800 (China Standard Time)

I am not sure. Are you downsampling in time, or frequency or in sky position?
I would have expected it to generate lesser clusters (wrt a larger sized image) if the downsampling was done in sky position.
If it is done in time and frequency, then I am not sure, and maybe visualizing clustering could help?

Casey Law · Answer 5 · Wed Dec 19 2018 22:34:19 GMT+0800 (China Standard Time)

I tried two things. First, downsampling all the input values (so pixel location, time, and DM) and second pixel location only. In both cases, I get way more clusters than without downsampling.
I suspect that downsampling produces more candidates that have identical locations, since the downsampling will make neighbors off by one unit (e.g., pixels 0 and 1) be equal. That makes the hamming metric equal to 0, whereas before the smallest distance was probably 1.
I'm thinking that the clustering algorithm thinks that those distance=0 should be clustered more than ones with distance=1. That might split them up more than without downsampling.

Kshitij Aggarwal · Answer 6 · Wed Dec 19 2018 22:46:18 GMT+0800 (China Standard Time)

I see. That could be a reason. Do you happen to have some plots for the same? It would be easier to interpret from there ... otherwise I can submit the pull request with the clustering_visualization function to generate those.

Casey Law · Answer 7 · Thu Dec 20 2018 06:03:46 GMT+0800 (China Standard Time)

Another idea is to play with the way the distances are calculated.
Currently we use metric='hamming', because it seemed to work. We could use a different metric or calculate the distance ourselves. It could be calculated with the hamming metric, but we can weight the columns to give more emphasis to some (e.g., x/y distance is weighted less than dm or t).

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html#scipy.spatial.distance.hamming

Casey Law · Answer 8 · Thu Jan 17 2019 01:31:13 GMT+0800 (China Standard Time)

I'm going to close this, since it really is a question of using the best metric. Rounding is just a workaround and it seems like a bad way to do it.