Do not understand output

Question

Do not understand output

josedvq opened this issue 6 years ago · comments

Hi, I'd appreciate some explanation on what the inputs/outputs actually mean. I see the distance_matrix as the matrix of distances between the cluster centroids/representatives as described in the paper, where element (i,j) is the distance between cluster i from first_histogram and cluster j from second_histogram. In this case there is of course no restriction on distance_matrix being symmetric. However, I see that in the examples the matrix is always symmetric. Is this just a coincidence or is there something I don't know?

Also, why is it that the following two examples produce different results? Why is it that the flow is the same?

first_histogram = np.array([4.0, 6.0])
second_histogram = np.array([5.0, 5.0])
distance_matrix = np.array([[0.5, 0.0],[0.0, 0.5]])
emd_with_flow(first_histogram, second_histogram, distance_matrix)
(0.0, [[4.0, 0.0], [1.0, 5.0]])

first_histogram = np.array([4.0, 6.0])
second_histogram = np.array([5.0, 5.0])
distance_matrix = np.array([[0.0, 0.5],[0.5, 0.0]])
emd_with_flow(first_histogram, second_histogram, distance_matrix)
(0.5, [[4.0, 0.0], [1.0, 5.0]])

Thanks for your time.

William Mayner · Answer 1 · Tue Feb 12 2019 07:54:32 GMT+0800 (China Standard Time)

Hi,

It is not a coincidence that the distance matrix is symmetric. The distance matrix gives the ground distances between the bins, and the ground distance is assumed to be a metric. If isn't, then the EMD will not be a metric either.

Moreover, I believe that the optimization routines that solve the min-cost-flow problem rely on the metric property. From the paper (emphasis mine):

A Monge sequence contains edges in the flow-network that can be pre-flowed
(in the order of the sequence) without changing the min-cost solution [24, 16]. For example, if the ground-distance is a metric, zero-cost edges are Monge sequence [38]. Alon et al. [4] introduced an efficient algorithm which determines the longest Monge sequence.

So, if you don't pass a distance matrix that represents a metric (i.e., it is nonnegative, zero only when i == j, symmetric, and satisfies the triangle inequality), the output may not be valid.

This explains why your examples don't match.

In the first example, the distance matrix is not a metric, since there are nonzero values on the diagonal: the distance from a bin to itself is 0.5. In the underlying C++ implementation from Pele & Werman, I believe these entries are assumed to be zero. Then, since the distance from one bin to another is 0, the cost of moving any amount of mass is 0, and therefore so is the EMD (for any two signatures of equal sum, and also for two signatures of unequal sum if you pass extra_mass_penalty=0).

In the second example, the distance matrix is a metric, and the answer is correct. The minimum-cost solution is to move 1 unit mass from the second bin to the first over 0.5 distance, which costs 1 * 0.5 = 0.5.

Note that this is mentioned in the documentation for the distance_matrix argument:

This defines the underlying metric, or ground distance, by giving the pairwise distances between the histogram bins. It must represent a metric; there is no warning if it doesn't.

Jose Vargas · Answer 2 · Tue Feb 12 2019 18:39:35 GMT+0800 (China Standard Time)

Thanks your detailed answer. I think I understand what is going on now. For the distance matrix to be symmetric and with zero-diagonal there are two assumptions that must be met:

The distance measure is a metric
The histogram bin centers of P and Q are the same

This implementation seems to assume both. The paper seems to only be explicit about the first assumption, but I do not see anywhere the restriction on the histogram bin centers being the same. Have you seen this anywhere? The general case where they are not the same is very interesting in computer vision.

William Mayner · Answer 3 · Wed Feb 13 2019 05:41:14 GMT+0800 (China Standard Time)

The case where the histogram bin centers of P and Q are not the same is not more general (mathematically, at least). This is because one can simply consider the union of the two sets of bins as the region over which mass is distributed.

For example, suppose we have the following bins, signatures, and distances:

bins1 = [0, 2]
bins2 = [1, 3]
signature_over_bins1 = [1, 1]
signature_over_bins2 = [1, 1]
distances_from_bins1_to_bins2 = [
    [1, 3],
    [1, 1]
]

Then that is equivalent to the following:

s1_full = [1, 0, 1, 0]
s2_full = [0, 1, 0, 1]
distances_from_all_bins_to_all_bins = [
    [0, 1, 2, 3],
    [1, 0, 1, 2],
    [2, 1, 0, 1],
    [3, 2, 1, 0]
]

This equivalence is why the restriction you mention is not really a restriction. I believe that is why there's no mention of it in the paper; discussions of the EMD can assume that the signatures are distributions over the same bins without loss of generality. The main point is that the distance matrix is a concrete instantiation of the underlying metric. Here, the second distance matrix (but not the first) has the metric properties.

That said, the first representation is clearly more concise. In this implementation, there is no way to specify the bins separately from the mass distributions; as you say, the signatures are assumed to be distributions over the same bins, which is why the distance matrix is assumed to have zeros on the diagonal.

The upshot is that you can still use the library in your case—you just need to set up the appropriate distance matrix (as in the second representation above).

Jose Vargas · Answer 4 · Wed Feb 13 2019 06:34:02 GMT+0800 (China Standard Time)

That makes complete sense. Thanks for taking the time.