machine-learning clustering semi-supervised-learning sbm mixture-of-gaussians mixture-model graphs

SSC-IPA

Source code of "Semi-Supervised Clustering with Inaccurate Pairwise Annotations" (Gribel, Gendreau and Vidal, 2021).

Run

To run the SSC-IPA algorithm, open the Julia terminal and try the following commands:

julia> include("Optimizer.jl")

julia> in = Input(seed, max_it, supervision_flag, prior)

julia> main("dataset", "must_graph", "cannot_graph", in)

Example

julia> include("Optimizer.jl")

julia> in = Input(1234, 50, 1, 0.9)

julia> main("vertebral.data", "vertebral-must.link", "vertebral-cannot.link", in)

Parameters of `Input`

seed: Numerical seed

max_it: Maximum number of iterations the algorithm will take.

supervision_flag: Determines if pairwise supervision is used (0: unsupervised algorithm, 1: semi-supervised algorithm).

prior: Prior estimation regarding the experts' accuracy (between 0 and 1; enter -1 for no priors)

Parameters of the `main` function

dataset: Dataset file. Important: You must provide a file with the .data extension along with a labels (ground-truth) file. The labels file must have the .label extension. Example: For a dataset named "vertebral.data", you must provide the "vertebral.label" file in the same folder.

must_graph: Must-link graph file.

cannot_graph: Cannot-link graph file.

Important: The dataset, labels, must-link graph, and cannot-link graph files must be within the /data folder inside the project.

Data format

Dataset files. The dataset file has N rows and D columns, where N is the number of data samples and D is the number of features. Each line contains the values of the D features of a data sample, where x_ij correspond to the j-th feature of the i-th sample of the data. Each feature value is separated by a single space, as depicted in the scheme below:

x₁₁	x₁₂	x₁₃	...	x_1d
x₂₁	x₂₂	x₂₃	...	x_2d
...	...	...	...	...
x_n1	x_n2	x_n3	...	x_nd

Important: The dataset files must have the .data extension.

Graph files. A graph file (must-link or cannot-link) has m rows and 3 columns, where m is the number of connections (links) in the graph. The first two columns represent the two data samples of an edge, whereas and third column represents the edge weight. The scheme below describes a graph file, where s_i and t_i are two connected samples, and w_i is the corresponding edge weight:

s₁	t₁	w₁
s₂	t₂	w₂
...	...	...
s_m	t_m	w_m

Labels files. The content of a labels file exhibits the cluster of each sample of the dataset according to the ground-truth, where y_i corresponds to the label of the i-th sample:

y₁

y₂

...

y_n

Important: The labels files must have the .label extension.

About

Source code of "Semi-Supervised Clustering with Inaccurate Pairwise Annotations" (Gribel, Gendreau and Vidal, 2021)

machine-learning clustering semi-supervised-learning sbm mixture-of-gaussians mixture-model graphs

Languages

Language:Julia 100.0%

danielgribel / SSC-IPA

SSC-IPA

Related Article

Run

Example

Parameters of `Input`

Parameters of the `main` function

Data format

About

Languages

SSC-IPA

Related Article

Run

Example

Parameters of Input

Parameters of the main function

Data format

About

Languages

Parameters of `Input`

Parameters of the `main` function