danielgribel / SSC-IPA

Source code of "Semi-Supervised Clustering with Inaccurate Pairwise Annotations" (Gribel, Gendreau and Vidal, 2021)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SSC-IPA

Source code of "Semi-Supervised Clustering with Inaccurate Pairwise Annotations" (Gribel, Gendreau and Vidal, 2021).

Related Article

Semi-Supervised Clustering with Inaccurate Pairwise Annotations: https://arxiv.org/abs/2104.02146

Run

To run the SSC-IPA algorithm, open the Julia terminal and try the following commands:

julia> include("Optimizer.jl")

julia> in = Input(seed, max_it, supervision_flag, prior)

julia> main("dataset", "must_graph", "cannot_graph", in)

Example

julia> include("Optimizer.jl")

julia> in = Input(1234, 50, 1, 0.9)

julia> main("vertebral.data", "vertebral-must.link", "vertebral-cannot.link", in)

Parameters of Input

seed: Numerical seed

max_it: Maximum number of iterations the algorithm will take.

supervision_flag: Determines if pairwise supervision is used (0: unsupervised algorithm, 1: semi-supervised algorithm).

prior: Prior estimation regarding the experts' accuracy (between 0 and 1; enter -1 for no priors)

Parameters of the main function

dataset: Dataset file. Important: You must provide a file with the .data extension along with a labels (ground-truth) file. The labels file must have the .label extension. Example: For a dataset named "vertebral.data", you must provide the "vertebral.label" file in the same folder.

must_graph: Must-link graph file.

cannot_graph: Cannot-link graph file.

Important: The dataset, labels, must-link graph, and cannot-link graph files must be within the /data folder inside the project.

Data format

Dataset files. The dataset file has N rows and D columns, where N is the number of data samples and D is the number of features. Each line contains the values of the D features of a data sample, where xij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:

x11 x12 x13 ... x1d
x21 x22 x23 ... x2d
... ... ... ... ...
xn1 xn2 xn3 ... xnd

Important: The dataset files must have the .data extension.

Graph files. A graph file (must-link or cannot-link) has m rows and 3 columns, where m is the number of connections (links) in the graph. The first two columns represent the two data samples of an edge, whereas and third column represents the edge weight. The scheme below describes a graph file, where si and ti are two connected samples, and wi is the corresponding edge weight:

s1 t1 w1
s2 t2 w2
... ... ...
sm tm wm

Labels files. The content of a labels file exhibits the cluster of each sample of the dataset according to the ground-truth, where yi corresponds to the label of the i-th sample:

y1

y2

...

yn

Important: The labels files must have the .label extension.

About

Source code of "Semi-Supervised Clustering with Inaccurate Pairwise Annotations" (Gribel, Gendreau and Vidal, 2021)


Languages

Language:Julia 100.0%