GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[RuntimeError] Retrain Error

jomariya23156 opened this issue · comments

Hi, I got this error when I tried to retrain the model. What could be possible causes?

RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

I used this code setting

address_parser = AddressParser(model_type="best", device=0)
lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])

I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.

There seems to be a size mismatch pertaining to the sequence lengths. Could you please share the stack trace associated with the error so I can get a better understanding of what happened?

Here it is.
Also, this is my pickle data file that I'm about to retrain and its .csv version before dumping it to a pickle file.
https://drive.google.com/file/d/1YHFSgQ2JpFL-mx_fhOa5Gwl6pEocBQa-/view?usp=sharing

Epoch:  1/15 Step:   13/3750   0.35% |                    |ETA: 9632.69s loss: 9.458249 accuracy: 70.056496  
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-30-cd5dde7d9057> in <module>
----> 1 address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])

~\anaconda3\lib\site-packages\deepparse\parser\address_parser.py in retrain(self, dataset_container, train_ratio, batch_size, epochs, num_workers, learning_rate, callbacks, seed, logging_path)
    296                          batch_metrics=[accuracy])
    297 
--> 298         train_res = exp.train(train_generator,
    299                               valid_generator=valid_generator,
    300                               epochs=epochs,

~\anaconda3\lib\site-packages\poutyne\framework\experiment.py in train(self, train_generator, valid_generator, **kwargs)
    475             List of dict containing the history of each epoch.
    476         """
--> 477         return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)
    478 
    479     def train_dataset(self, train_dataset, valid_dataset=None, **kwargs) -> List[Dict]:

~\anaconda3\lib\site-packages\poutyne\framework\experiment.py in _train(self, training_func, callbacks, lr_schedulers, keep_only_last_best, save_every_epoch, disable_tensorboard, seed, *args, **kwargs)
    616 
    617         try:
--> 618             return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
    619         finally:
    620             if tensorboard_writer is not None:

~\anaconda3\lib\site-packages\poutyne\framework\model.py in fit_generator(self, train_generator, valid_generator, epochs, steps_per_epoch, validation_steps, batches_per_step, initial_epoch, verbose, progress_options, callbacks)
    546             self._fit_generator_n_batches_per_step(epoch_iterator, callback_list, batches_per_step)
    547         else:
--> 548             self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
    549 
    550         return epoch_iterator.epoch_logs

~\anaconda3\lib\site-packages\poutyne\framework\model.py in _fit_generator_one_batch_per_step(self, epoch_iterator, callback_list)
    626             with self._set_training_mode(True):
    627                 for step, (x, y) in train_step_iterator:
--> 628                     step.loss, step.metrics, _ = self._fit_batch(x, y, callback=callback_list, step=step.number)
    629                     step.size = self.get_batch_size(x, y)
    630 

~\anaconda3\lib\site-packages\poutyne\framework\model.py in _fit_batch(self, x, y, callback, step, return_pred)
    649         self.optimizer.zero_grad()
    650 
--> 651         loss_tensor, metrics, pred_y = self._compute_loss_and_metrics(x,
    652                                                                       y,
    653                                                                       return_loss_tensor=True,

~\anaconda3\lib\site-packages\poutyne\framework\model.py in _compute_loss_and_metrics(self, x, y, return_loss_tensor, return_pred)
   1225             loss = float(loss)
   1226         with torch.no_grad():
-> 1227             metrics = self._compute_batch_metrics(pred_y, y)
   1228             for epoch_metric in self.epoch_metrics:
   1229                 epoch_metric(pred_y, y)

~\anaconda3\lib\site-packages\poutyne\framework\model.py in _compute_batch_metrics(self, pred_y, y)
   1233 
   1234     def _compute_batch_metrics(self, pred_y, y):
-> 1235         metrics = [metric(pred_y, y) for metric in self.batch_metrics]
   1236         return self._compute_metric_array(metrics, self.unflatten_batch_metrics_names)
   1237 

~\anaconda3\lib\site-packages\poutyne\framework\model.py in <listcomp>(.0)
   1233 
   1234     def _compute_batch_metrics(self, pred_y, y):
-> 1235         metrics = [metric(pred_y, y) for metric in self.batch_metrics]
   1236         return self._compute_metric_array(metrics, self.unflatten_batch_metrics_names)
   1237 

~\anaconda3\lib\site-packages\deepparse\metrics\accuracy.py in accuracy(pred, ground_truth)
      6     Accuracy per tag.
      7     """
----> 8     return acc(pred.transpose(0, 1).transpose(-1, 1), ground_truth)

~\anaconda3\lib\site-packages\poutyne\framework\metrics\batch_metrics.py in acc(y_pred, y_true, ignore_index, reduction)
     70     weights = (y_true != ignore_index).float()
     71     num_labels = weights.sum()
---> 72     acc_pred = (y_pred == y_true).float() * weights
     73 
     74     if reduction in ['mean', 'sum']:

~\anaconda3\lib\site-packages\torch\tensor.py in wrapped(*args, **kwargs)
     26     def wrapped(*args, **kwargs):
     27         try:
---> 28             return f(*args, **kwargs)
     29         except TypeError:
     30             return NotImplemented

RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1


It seems like some data points have not the same length between the # of tag and the ground truth. We split the sequence using the whitespace character maybe you can take a look at that.

@davebulaval I have store my labeled list to retrain the model as the 'target' column in my DataFrame and I use these lines

df['len_target'] = df['target'].apply(lambda x: len(x))
df['len_raw'] = df['raw_address'].apply(lambda x: len(x.split()))
np.sum(df['len_target'] != df['len_raw'])

The output is 0. This is to make sure that the number of raw string words (raw_address) split by whitespace is equal to the number of elements in the list for retraining the model.

Can you share me in private (or not) your code?

I've also tested on my side, and I have the same results. Maybe something buggy happened later on during the vectorizing (we also remove the , since none were shown during training, which lowered the results (we are working on a more robust fix).

Here you go.
https://drive.google.com/file/d/1P7jC-vI335vFTuFzGGJDXzeX4Qv-5rpr/view?usp=sharing
Train data, Data preparation for training, Training process are in this zip.

On my side, using the pickled data, I have differences between some address and ground truth.

training_container = PickleDatasetContainer('deepparse_retrain.pickle')
[(x, y, len(x.split(" ")), len(y), x.split(" ")) for x, y in training_container.data if len(x.split(" ")) != len(y)]

@davebulaval Ah, I got it. In some cases, there are double white spaces, so .split() will give an extra whitespace token which I didn't label and deal with it for training. Or maybe there might be something wrong when I pickled it. Thanks a lot!

@davebulaval Ah, I got it. In some cases, there are double white spaces, so .split() will give an extra whitespace token which I didn't label and deal with it for training. Or maybe there might be something wrong when I pickled it. Thanks a lot!

Can you send after the cleaner address in a CSV so we can improve the original dataset (that we want to release soon)?
Also, if you can include your complete name to be added as an author of this part of the dataset.

I'm afraid not. This is the dataset I got from the recent Shopee Code League 2021 competition (SEA coding competition) and the competition has already been concluded. So, I'm done with this project and currently working on another one. However, you can access this page https://www.kaggle.com/c/scl-2021-ds/code, I saw some people posting their cleaning process code there. Hope this helps.