ZeroRin / BertGCN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Labels for unlabeled data while training the model

padshahrohan opened this issue · comments

As BertGCN and GCN in general is a good model for Semi supervised learning. I have a dataset which has a 2 class classification with lot of unlabeled data. In finetune_bert.py and train_bert_gcn.py labels are converted to class ID for pytorch computation. As unlabeled data is also passed to training dataset, what should be put in labels for this data. Going with current code it will give it a 0 class with the following code:

# transform one-hot label to class ID for pytorch computation
y = th.LongTensor((y_train + y_val +y_test).argmax(axis=1))
label = {}
label['train'], label['val'], label['test'] = y[:nb_train], y[nb_train:nb_train+nb_val], y[-nb_test:]

Can pytorch take one-hot labels as labels instead of class ID? Or Can any other approach be used?

I suppose that marking only labeled data as train and others as test should do the trick.

As per my understanding of semi-supervised model, we should be able to use unlabeled data to train the model, but i also know model would learn only from labeled data, so how will BertGCN use unlabeled data to train the model, I want to primarily do that

the predict result of a data based on not only itself but also its neighbors (both labeled and unlabeled) so that unlabeled data will also be touched during training
during training, we are using training data, training labels and test data, the only thing not included is test labels. to my understanding this is similar to semi-supervised learning

There are 2 parts to this, 1 is training Bert (a pretrained model with our data) and another is use Berts embeddings to train GCN and a linear interpolation is applied for prediction.

In case of part 2 GCN , does this mean that the input_ids and attention_mask will be created from entire data (labeled and unlabeled) with below line:
input_ids, attention_mask = encode_input(text, model.tokenizer), the text variable will consist of all the data.

But label['train'] will contain only labels of labeled training data, the indexes for unlabeled data will be skipped because the train_mask matrix will contain mask for only the training labels (no val,test,unlabel), same for label['val'], label['test'] respectively.

Is my understanding correct that model will be trained on entire input_ids and the label['train'],label['val'],label['test'] matrix is just used for calculating loss and accuracy?

In case of part 1 training Bert (a pretrained model with our data), are you suggesting to train only with labeled data and put unlabeled data in test ?

note that the embeddings of unlabeled data affects the predict result of labeled data during graph convolution, which means that bert will learn to extract embeddings from unlabeled data to help classifying labeled data.

So we will be using unlabeled data to train both Bert an GCN models.

For Bert, Though the train accuracy will be calculated for both labeled and unlabeled data as both are part of training set which might be wrong.

(input_ids, attention_mask, label) = [x.to(gpu) for x in batch]
    optimizer.zero_grad()
    y_pred = model(input_ids, attention_mask)
    y_true = label.type(th.long)
    loss = F.cross_entropy(y_pred, y_true)

For GCN

train_mask = g.ndata['train'][idx].type(th.BoolTensor)
    y_pred = model(g, idx)[train_mask]
    y_true = g.ndata['label_train'][idx][train_mask]
    loss = F.nll_loss(y_pred, y_true)

Can we fix it somehow?

Also can you just clarify one more question: That the label matrix is just used for calculating loss and accuracy?

guess i'm not familiar with semi-supervised learning enough, i am a bit confused. are you trying to assign some kind of pseudo labels to unlabeled data and compute loss on them, or you just do not want to calculate loss on unlabeled data?

I have total 8823 documents as follows:

200 labeled data
8623 unlabeled data

I want to use unlabeled data to train the model without assigning them any labels. so now how to use this in finetuning bert and training bert gcn?

finetuning bert with unlabeled data is out of the scope of our work. our finetuning here mainly focus on learning towards the target classification task.

for training bert gcn, putting all unlabeled data in test set should do the trick. their embeddings will contribute to the classification of labeled nodes, and the bert model will learn to extract more helpful embeddings from unlabeled datas in the procedure.