[RuntimeError] Retrain Error
jomariya23156 opened this issue · comments
Hi, I got this error when I tried to retrain the model. What could be possible causes?
RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1
I used this code setting
address_parser = AddressParser(model_type="best", device=0)
lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])
I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.
There seems to be a size mismatch pertaining to the sequence lengths. Could you please share the stack trace associated with the error so I can get a better understanding of what happened?
Here it is.
Also, this is my pickle data file that I'm about to retrain and its .csv version before dumping it to a pickle file.
https://drive.google.com/file/d/1YHFSgQ2JpFL-mx_fhOa5Gwl6pEocBQa-/view?usp=sharing
Epoch: 1/15 Step: 13/3750 0.35% | |ETA: 9632.69s loss: 9.458249 accuracy: 70.056496
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-30-cd5dde7d9057> in <module>
----> 1 address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])
~\anaconda3\lib\site-packages\deepparse\parser\address_parser.py in retrain(self, dataset_container, train_ratio, batch_size, epochs, num_workers, learning_rate, callbacks, seed, logging_path)
296 batch_metrics=[accuracy])
297
--> 298 train_res = exp.train(train_generator,
299 valid_generator=valid_generator,
300 epochs=epochs,
~\anaconda3\lib\site-packages\poutyne\framework\experiment.py in train(self, train_generator, valid_generator, **kwargs)
475 List of dict containing the history of each epoch.
476 """
--> 477 return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)
478
479 def train_dataset(self, train_dataset, valid_dataset=None, **kwargs) -> List[Dict]:
~\anaconda3\lib\site-packages\poutyne\framework\experiment.py in _train(self, training_func, callbacks, lr_schedulers, keep_only_last_best, save_every_epoch, disable_tensorboard, seed, *args, **kwargs)
616
617 try:
--> 618 return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
619 finally:
620 if tensorboard_writer is not None:
~\anaconda3\lib\site-packages\poutyne\framework\model.py in fit_generator(self, train_generator, valid_generator, epochs, steps_per_epoch, validation_steps, batches_per_step, initial_epoch, verbose, progress_options, callbacks)
546 self._fit_generator_n_batches_per_step(epoch_iterator, callback_list, batches_per_step)
547 else:
--> 548 self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
549
550 return epoch_iterator.epoch_logs
~\anaconda3\lib\site-packages\poutyne\framework\model.py in _fit_generator_one_batch_per_step(self, epoch_iterator, callback_list)
626 with self._set_training_mode(True):
627 for step, (x, y) in train_step_iterator:
--> 628 step.loss, step.metrics, _ = self._fit_batch(x, y, callback=callback_list, step=step.number)
629 step.size = self.get_batch_size(x, y)
630
~\anaconda3\lib\site-packages\poutyne\framework\model.py in _fit_batch(self, x, y, callback, step, return_pred)
649 self.optimizer.zero_grad()
650
--> 651 loss_tensor, metrics, pred_y = self._compute_loss_and_metrics(x,
652 y,
653 return_loss_tensor=True,
~\anaconda3\lib\site-packages\poutyne\framework\model.py in _compute_loss_and_metrics(self, x, y, return_loss_tensor, return_pred)
1225 loss = float(loss)
1226 with torch.no_grad():
-> 1227 metrics = self._compute_batch_metrics(pred_y, y)
1228 for epoch_metric in self.epoch_metrics:
1229 epoch_metric(pred_y, y)
~\anaconda3\lib\site-packages\poutyne\framework\model.py in _compute_batch_metrics(self, pred_y, y)
1233
1234 def _compute_batch_metrics(self, pred_y, y):
-> 1235 metrics = [metric(pred_y, y) for metric in self.batch_metrics]
1236 return self._compute_metric_array(metrics, self.unflatten_batch_metrics_names)
1237
~\anaconda3\lib\site-packages\poutyne\framework\model.py in <listcomp>(.0)
1233
1234 def _compute_batch_metrics(self, pred_y, y):
-> 1235 metrics = [metric(pred_y, y) for metric in self.batch_metrics]
1236 return self._compute_metric_array(metrics, self.unflatten_batch_metrics_names)
1237
~\anaconda3\lib\site-packages\deepparse\metrics\accuracy.py in accuracy(pred, ground_truth)
6 Accuracy per tag.
7 """
----> 8 return acc(pred.transpose(0, 1).transpose(-1, 1), ground_truth)
~\anaconda3\lib\site-packages\poutyne\framework\metrics\batch_metrics.py in acc(y_pred, y_true, ignore_index, reduction)
70 weights = (y_true != ignore_index).float()
71 num_labels = weights.sum()
---> 72 acc_pred = (y_pred == y_true).float() * weights
73
74 if reduction in ['mean', 'sum']:
~\anaconda3\lib\site-packages\torch\tensor.py in wrapped(*args, **kwargs)
26 def wrapped(*args, **kwargs):
27 try:
---> 28 return f(*args, **kwargs)
29 except TypeError:
30 return NotImplemented
RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1
It seems like some data points have not the same length between the # of tag and the ground truth. We split the sequence using the whitespace character maybe you can take a look at that.
@davebulaval I have store my labeled list to retrain the model as the 'target' column in my DataFrame and I use these lines
df['len_target'] = df['target'].apply(lambda x: len(x))
df['len_raw'] = df['raw_address'].apply(lambda x: len(x.split()))
np.sum(df['len_target'] != df['len_raw'])
The output is 0. This is to make sure that the number of raw string words (raw_address) split by whitespace is equal to the number of elements in the list for retraining the model.
Can you share me in private (or not) your code?
I've also tested on my side, and I have the same results. Maybe something buggy happened later on during the vectorizing (we also remove the ,
since none were shown during training, which lowered the results (we are working on a more robust fix).
Here you go.
https://drive.google.com/file/d/1P7jC-vI335vFTuFzGGJDXzeX4Qv-5rpr/view?usp=sharing
Train data, Data preparation for training, Training process are in this zip.
On my side, using the pickled data, I have differences between some address and ground truth.
training_container = PickleDatasetContainer('deepparse_retrain.pickle')
[(x, y, len(x.split(" ")), len(y), x.split(" ")) for x, y in training_container.data if len(x.split(" ")) != len(y)]
@davebulaval Ah, I got it. In some cases, there are double white spaces, so .split() will give an extra whitespace token which I didn't label and deal with it for training. Or maybe there might be something wrong when I pickled it. Thanks a lot!
@davebulaval Ah, I got it. In some cases, there are double white spaces, so .split() will give an extra whitespace token which I didn't label and deal with it for training. Or maybe there might be something wrong when I pickled it. Thanks a lot!
Can you send after the cleaner address in a CSV so we can improve the original dataset (that we want to release soon)?
Also, if you can include your complete name to be added as an author of this part of the dataset.
I'm afraid not. This is the dataset I got from the recent Shopee Code League 2021 competition (SEA coding competition) and the competition has already been concluded. So, I'm done with this project and currently working on another one. However, you can access this page https://www.kaggle.com/c/scl-2021-ds/code, I saw some people posting their cleaning process code there. Hope this helps.