how to build a dataset

Question

how to build a dataset

Johnson-yue opened this issue 4 years ago · comments

I want to train with my own dataset， it is like FONT-SVG dataset.

but the original data format is .ttf, So My qestion is how to build a dataset as yours（.csv and *.pth）

Mybe I can export some_char from *.ttf to some_char.svg, but if you know how to batch export , please tell me
How some_char.svg convert to *.pth ?
My guess:

svg=SVG.load_svg("some_char.svg“).normalize().zoom(0.9).canonicalize().simplify_heuristic()
tensor_data = svg.to_tensor()
svg_data = SVGTensor.from_data(tensor_data)

torch.save("filename", tensor_data) ???

In svgtensor.ipynb , if I want to optimize from Img2 to Img1 ,not from SVG.unit_circle() to Img1, how should I do,
I tried replaced svg_pred with other SVG.load_data() , but occur the Error:

Attribute Error: 'Point' object has no attribute ”control1“， "control2", "requires_grad_"

Alexandre Carlier · Answer 1 · Mon Jul 27 2020 19:07:17 GMT+0800 (China Standard Time)

Hey, great question!
I'll write a simple notebook this evening or tomorrow explaining the process step by step.

You're correct, you need to have individual glyphs in SVG format. There is already a method implemented to convert from FontForge's SplineSet to SVG. So if you don't use FontForge, you will need to convert .ttf to .svg yourself.
It's almost like that, although you also need to add data augmentation. All tensors are then added to a dictionary and saved in .pkl format.
Please create a separate issue for this! I'll add your use case to the notebook, although it may take a little longer since I didn't have to use it yet, but it sounds very feasible :)

Johnson yue · Answer 2 · Mon Jul 27 2020 22:42:38 GMT+0800 (China Standard Time)

ok, I will create a new issue about 3

thank you for your reply

Alexandre Carlier · Answer 3 · Mon Jul 27 2020 23:32:09 GMT+0800 (China Standard Time)

Will close when I've written the custom dataset creation notebook 😉

Johnson yue · Answer 4 · Mon Jul 27 2020 23:32:51 GMT+0800 (China Standard Time)

you are so nice !!!

Mohammadhossein Bahari · Answer 5 · Sat Sep 26 2020 00:53:46 GMT+0800 (China Standard Time)

Hi Alex,

Great work. Congrats!

I want to try the network on my own data, which are raster images.
I dynamically convert them to SVG using potrace (let me know if there exists a more efficient way please!), and then use the above code to get the tensor:

svg=SVG.load_svg("some_char.svg“).normalize().zoom(0.9).canonicalize().simplify_heuristic()
tensor_data = svg.to_tensor()
svg_data = SVGTensor.from_data(tensor_data)

Here are my questions:
1- Do I need these: ".normalize().zoom(0.9).canonicalize().simplify_heuristic()" ?
2- Is there any preprocessing I have to do ?
3- How can I convert svg_data to the format you pass to the model? They are tensors but this one is SVGTensor.

Thanks

pwichmann · Answer 6 · Tue Dec 01 2020 04:05:05 GMT+0800 (China Standard Time)

It would be amazing to learn how to train from scratch, i.e. on a a bunch of folders with SVGs.
Super excited by this project. :D

@Johnson-yue @alexandre01

tsaxena · Answer 7 · Sat Dec 19 2020 03:59:41 GMT+0800 (China Standard Time)

@alexandre01 you mentioned that you have already added the "custom dataset creation notebook" but I am not sure which one it is. Am I missing something?

Matúš Ždanský · Answer 8 · Fri Apr 30 2021 17:22:06 GMT+0800 (China Standard Time)

Hello, Alex.

Great work and Thank You for this library 👍
I have been playing around with it and inspired by the preprocess.py, modified it to a (much needed) simple script to batch convert SVGs to *.pkl tensors.

To anyone interested:

from concurrent import futures
import os
from argparse import ArgumentParser
import logging
from tqdm import tqdm
import glob
import pickle

import sys
sys.path.append('..')
from deepsvg.svglib.svg import SVG


def convert_svg(svg_file, output_folder):
    filename = os.path.splitext(os.path.basename(svg_file))[0]
    svg = SVG.load_svg(svg_file)
    tensor_data = svg.to_tensor()

    with open(os.path.join(output_folder, f"{filename}.pkl"), "wb") as f:
        dict_data = {
            "tensors": [[tensor_data]],
            "fillings": [0]
        }
        pickle.dump(dict_data, f, pickle.HIGHEST_PROTOCOL)


def main(args):
    with futures.ThreadPoolExecutor(max_workers=args.workers) as executor:
        svg_files = glob.glob(os.path.join(args.input_folder, "*.svg"))

        with tqdm(total=len(svg_files)) as pbar:
            preprocess_requests = [executor.submit(convert_svg, svg_file, args.output_folder) for svg_file in svg_files]
            for _ in futures.as_completed(preprocess_requests):
                pbar.update(1)

    logging.info("SVG files' conversion to tensors complete.")


if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    parser = ArgumentParser()
    parser.add_argument("--input_folder")
    parser.add_argument("--output_folder")
    parser.add_argument("--workers", default=4, type=int)

    args = parser.parse_args()

    if not os.path.exists(args.output_folder): os.makedirs(args.output_folder)

    main(args)

All the best.

George Profenza · Answer 9 · Fri Apr 15 2022 00:36:19 GMT+0800 (China Standard Time)

Hi @alexandre01

Thank you so sharing this repo! Very interesting work!

I'm also trying to train deepsvg on a custom dataset, but I'm unsure how the data should be structured.

I've tried to train and got into an indexing issue I don't fully understand:

Traceback (most recent call last):
  File "c:\users\george.profenza\.pyenv\pyenv-win\versions\3.7.4-amd64\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\george.profenza\.pyenv\pyenv-win\versions\3.7.4-amd64\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\train.py", line 150, in <module>
    train(cfg, model_name, experiment_name, log_dir=args.log_dir, debug=args.debug, resume=args.resume)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\train.py", line 26, in train
    dataset = dataset_load_function(cfg)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 242, in load_dataset
    cfg.filter_uni, cfg.filter_platform, cfg.filter_category, cfg.train_ratio)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 57, in __init__
    loaded_tensor = self._load_tensor(self.idx_to_id(0))
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 111, in idx_to_id
    return self.df.iloc[idx].id
  File "C:\Users\george.profenza\Downloads\gp\deepsvg-env\lib\site-packages\pandas\core\indexing.py", line 931, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg-env\lib\site-packages\pandas\core\indexing.py", line 1566, in _getitem_axis
    self._validate_integer(key, axis)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg-env\lib\site-packages\pandas\core\indexing.py", line 1500, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

(self.idx_to_id(0) seems to be the issue)

I've tried using the preprocess script and noticed it's augmenting svgs, but it wasn't saving the pickle files.
I've attempted to use the comments above and got it to save .pkl files in different ways:

just using to_tensor():

    tensor_data = svg.to_tensor()


    with open(os.path.join(output_folder, f"{filename}.pkl"), "wb") as f:
        dict_data = {
            "tensors": [[tensor_data]],
            "fillings": [0]
        }
        pickle.dump(dict_data, f, pickle.HIGHEST_PROTOCOL)

a variation of the above (spotted in the svglib notebook): tensor_data = svg.copy().numericalize().to_tensor()

and also using SVGTensor:

    tensor_data = svg.copy().numericalize().to_tensor()
    tensor_data = SVGTensor.from_data(tensor_data)

I'm not sure what the correct method of converting the processed svg to pickle is so I can train.

Printing the pandas object from the loaded fonts dataset I do see relevant data:

self.df.iloc.obj                              id             binary_fp  uni  total_len  nb_groups        len_groups  max_len_group
0        5658657305760304754_99   5658657305760304754   99         22          1              [22]             22
1      11280665330421698568_108  11280665330421698568  108         19          1              [19]             19
2        6786671966848343352_97   6786671966848343352   97         27          2           [18, 9]             18
3      17302457245611577159_121  17302457245611577159  121         22          1              [22]             22
5       18110689581214114864_66  18110689581214114864   66         44          3        [27, 9, 8]             27
...                         ...                   ...  ...        ...        ...               ...            ...
99994  13209403418406559934_117  13209403418406559934  117         15          1              [15]             15
99996    9524159807492630733_50   9524159807492630733   50         23          1              [23]             23
99997   17351593260041237331_51  17351593260041237331   51         49          5  [26, 5, 6, 6, 6]             26
99998  14735752356892000110_110  14735752356892000110  110         26          1              [26]             26
99999    3067464349541363522_50   3067464349541363522   50         25          1              [25]             25

However, when loading my converted dataset (either using SVGTensor (larger pickle file) or just to_tensor() (smaller pickle file)), obj is empty:

self.df.iloc.obj Empty DataFrame

For reference, here's a raw svg:

<?xml version="1.0"?>
<!DOCTYPE svg PUBLIC '-//W3C//DTD SVG 1.0//EN'
          'http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd'>
<svg xmlns:xlink="http://www.w3.org/1999/xlink" style="fill-opacity:1; color-rendering:auto; color-interpolation:auto; text-rendering:auto; stroke:black; stroke-linecap:square; stroke-miterlimit:10; shape-rendering:auto; stroke-opacity:1; fill:black; stroke-dasharray:none; font-weight:normal; stroke-width:1; font-family:'Dialog'; font-style:normal; stroke-linejoin:miter; font-size:12px; stroke-dashoffset:0; image-rendering:auto;" width="500" height="500" xmlns="http://www.w3.org/2000/svg"
><!--Generated by the Batik Graphics2D SVG Generator--><defs id="genericDefs"
  /><g
  ><g style="stroke-linecap:round;"
    ><line y2="324.067" style="fill:none;" x1="236.4454" x2="109.2297" y1="204.986"
    /></g
    ><g style="stroke-linecap:round;"
    ><line y2="422.2296" style="fill:none;" x1="109.2297" x2="263.5546" y1="324.067"
      /><line y2="303.1487" style="fill:none;" x1="263.5546" x2="390.7703" y1="422.2296"
      /><line y2="204.986" style="fill:none;" x1="390.7703" x2="236.4454" y1="303.1487"
      /><line y2="77.7704" style="fill:none;" x1="109.2297" x2="236.4454" y1="196.8513"
      /><line y2="175.9331" style="fill:none;" x1="236.4454" x2="390.7703" y1="77.7704"
      /><line y2="295.014" style="fill:none;" x1="390.7703" x2="263.5546" y1="175.9331"
      /><line y2="196.8513" style="fill:none;" x1="263.5546" x2="109.2297" y1="295.014"
      /><line y2="422.2296" style="fill:none;" x1="390.7703" x2="263.5546" y1="303.1487"
      /><line y2="295.014" style="fill:none;" x1="263.5546" x2="263.5546" y1="422.2296"
      /><line y2="175.9331" style="fill:none;" x1="263.5546" x2="390.7703" y1="295.014"
      /><line y2="303.1487" style="fill:none;" x1="390.7703" x2="390.7703" y1="175.9331"
      /><line y2="204.986" style="fill:none;" x1="109.2297" x2="236.4454" y1="324.067"
      /><line y2="77.7704" style="fill:none;" x1="236.4454" x2="236.4454" y1="204.986"
      /><line y2="196.8513" style="fill:none;" x1="236.4454" x2="109.2297" y1="77.7704"
      /><line y2="324.067" style="fill:none;" x1="109.2297" x2="109.2297" y1="196.8513"
      /><line y2="175.9331" style="fill:none;" x1="390.7703" x2="390.7703" y1="303.1487"
      /><line y2="77.7704" style="fill:none;" x1="390.7703" x2="236.4454" y1="175.9331"
      /><line y2="204.986" style="fill:none;" x1="236.4454" x2="236.4454" y1="77.7704"
      /><line y2="303.1487" style="fill:none;" x1="236.4454" x2="390.7703" y1="204.986"
      /><line y2="196.8513" style="fill:none;" x1="109.2297" x2="109.2297" y1="324.067"
      /><line y2="295.014" style="fill:none;" x1="109.2297" x2="263.5546" y1="196.8513"
      /><line y2="422.2296" style="fill:none;" x1="263.5546" x2="263.5546" y1="295.014"
      /><line y2="324.067" style="fill:none;" x1="263.5546" x2="109.2297" y1="422.2296"
    /></g
  ></g
></svg
>

I've uploaded a few converted pkl as well (1, 2, 3)

Can you please advise on how I might get my own deepsvg dataset trained ?
(You've mentioned a training notebook (or google colab notebooks): would you happen to still have that around can share ?)

Thank you so much for your time,
George

George Profenza · Answer 10 · Fri Apr 15 2022 22:52:45 GMT+0800 (China Standard Time)

Update

I've managed to get past the empy data frame issue by hackily commenting this section in svgtensor_dataset.py:

            # df = df[(df.nb_groups <= max_num_groups) & (df.max_len_group <= max_seq_len)]
            # if max_total_len is not None:
            #     df = df[df.total_len <= max_total_len]

however this landed me right at this error:

Traceback (most recent call last):
  File "c:\users\george.profenza\.pyenv\pyenv-win\versions\3.7.4-amd64\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\george.profenza\.pyenv\pyenv-win\versions\3.7.4-amd64\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\train.py", line 150, in <module>
    train(cfg, model_name, experiment_name, log_dir=args.log_dir, debug=args.debug, resume=args.resume)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\train.py", line 51, in train
    cfg.set_train_vars(train_vars, dataloader)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\configs\deepsvg\default_icons.py", line 77, in set_train_vars
    for idx in random.sample(range(len(dataloader.dataset)), k=10)]
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\configs\deepsvg\default_icons.py", line 77, in <listcomp>
    for idx in random.sample(range(len(dataloader.dataset)), k=10)]
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 177, in get
    return self.get_data(t_sep, fillings, model_args=model_args, label=label)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 208, in get_data
    res[arg] = torch.stack([t.cmds() for t in t_list])
RuntimeError: stack expects each tensor to be equal size, but got [66] at entry 0 and [32] at entry 1

Suspecting it's related, but currently I don't fully understand how the data should be structured.

Any hints/tip on how I may train on a custom dataset are highly appreciated.

Thank you so much,
George

pwichmann · Answer 11 · Fri Apr 15 2022 23:40:48 GMT+0800 (China Standard Time)

Many months ago, I had retrained DeepSVG from scratch and developed a new library for preprocessing SVGs. I was able to retrain from scratch. Please ping me (here or on Twitter: @wichmaennchen) if the problems persist. I may be able to invest some time and help out.

George Profenza · Answer 12 · Sat Apr 16 2022 21:51:34 GMT+0800 (China Standard Time)

I had another shot and spotted the default model parameters that after as filters for the meta data frames.

However, I'm still stuck in the same get_data section.
In getting slightly different conditions hitting torch.stack errors, but it's pretty much the same area.

@pwichmann If you have version of deepSVG I'd like to give that a go.
(Will DM)

Thank you so much for offering to support

Kiliansas · Answer 13 · Thu Aug 11 2022 23:06:56 GMT+0800 (China Standard Time)

Update

I've managed to get past the empy data frame issue by hackily commenting this section in svgtensor_dataset.py:

            # df = df[(df.nb_groups <= max_num_groups) & (df.max_len_group <= max_seq_len)]
            # if max_total_len is not None:
            #     df = df[df.total_len <= max_total_len]

however this landed me right at this error:

Traceback (most recent call last):
  File "c:\users\george.profenza\.pyenv\pyenv-win\versions\3.7.4-amd64\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\george.profenza\.pyenv\pyenv-win\versions\3.7.4-amd64\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\train.py", line 150, in <module>
    train(cfg, model_name, experiment_name, log_dir=args.log_dir, debug=args.debug, resume=args.resume)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\train.py", line 51, in train
    cfg.set_train_vars(train_vars, dataloader)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\configs\deepsvg\default_icons.py", line 77, in set_train_vars
    for idx in random.sample(range(len(dataloader.dataset)), k=10)]
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\configs\deepsvg\default_icons.py", line 77, in <listcomp>
    for idx in random.sample(range(len(dataloader.dataset)), k=10)]
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 177, in get
    return self.get_data(t_sep, fillings, model_args=model_args, label=label)
  File "C:\Users\george.profenza\Downloads\gp\deepsvg\deepsvg\svgtensor_dataset.py", line 208, in get_data
    res[arg] = torch.stack([t.cmds() for t in t_list])
RuntimeError: stack expects each tensor to be equal size, but got [66] at entry 0 and [32] at entry 1

Suspecting it's related, but currently I don't fully understand how the data should be structured.

Any hints/tip on how I may train on a custom dataset are highly appreciated.

Thank you so much, George

Hi, I had the exact same problem as you. Do you have a solution now?

Ronghuan Wu · Answer 14 · Mon Oct 17 2022 20:18:25 GMT+0800 (China Standard Time)

RuntimeError: stack expects each tensor to be equal size, but ...

This problem is due to the fact that, the number of command of a path in your svg file, is greater than the limitation. The limitation is max_seq_len + 2 in deepsvg/model/config.py, where +2 represents EOS and SOS.

So, the following codes are used to select svgs that meet the requirement.

df = df[(df.nb_groups <= max_num_groups) & (df.max_len_group <= max_seq_len)]
if max_total_len is not None:
    df = df[df.total_len <= max_total_len]

Besides, if you want to construct your own dataset, you have to run preprocess.py. In this file, an important operation is drop_z() in svg.canonicalize(), which removes command Z in svg files. This is because, after my experiment, svg images remain the same after removing command Z.

Kiliansas · Answer 15 · Mon Oct 17 2022 23:09:47 GMT+0800 (China Standard Time)

In this file, an important operation is drop_z() in svg.canonicalize(), which removes command Z in svg files. This is because, after my experiment, svg images remain the same after removing command Z.

Agree with previous statement, but don't understand the operation to drop Z. Z command means moving the brush to the beginning of the path so that make the path closed. It's important and Z command is one of 7 command types encoded so I cannot understand the operation of removing them because this makes the command types seems nonsense.

Ronghuan Wu · Answer 16 · Tue Oct 18 2022 11:24:06 GMT+0800 (China Standard Time)

Agree with previous statement, but don't understand the operation to drop Z. Z command means moving the brush to the beginning of the path so that make the path closed. It's important and Z command is one of 7 command types encoded so I cannot understand the operation of removing them because this makes the command types seems nonsense.

Well, I agree that Z command is important in SVG files, and we should not drop Z.

But drop_z() seems resonable, because usually, Z is the last command of a path, M is the first command of a path. This means that, if we delete Z, we can still draw the correct SVG because the M command will move the cursor to the right position. But errors would occur when other commands like C and L follow the Z to be deleted.

Hassan · Answer 17 · Sun Jun 11 2023 21:03:11 GMT+0800 (China Standard Time)

Hello, Alex.

Great work and Thank You for this library 👍 I have been playing around with it and inspired by the preprocess.py, modified it to a (much needed) simple script to batch convert SVGs to *.pkl tensors.

To anyone interested:

from concurrent import futures
import os
from argparse import ArgumentParser
import logging
from tqdm import tqdm
import glob
import pickle

import sys
sys.path.append('..')
from deepsvg.svglib.svg import SVG


def convert_svg(svg_file, output_folder):
    filename = os.path.splitext(os.path.basename(svg_file))[0]
    svg = SVG.load_svg(svg_file)
    tensor_data = svg.to_tensor()

    with open(os.path.join(output_folder, f"{filename}.pkl"), "wb") as f:
        dict_data = {
            "tensors": [[tensor_data]],
            "fillings": [0]
        }
        pickle.dump(dict_data, f, pickle.HIGHEST_PROTOCOL)


def main(args):
    with futures.ThreadPoolExecutor(max_workers=args.workers) as executor:
        svg_files = glob.glob(os.path.join(args.input_folder, "*.svg"))

        with tqdm(total=len(svg_files)) as pbar:
            preprocess_requests = [executor.submit(convert_svg, svg_file, args.output_folder) for svg_file in svg_files]
            for _ in futures.as_completed(preprocess_requests):
                pbar.update(1)

    logging.info("SVG files' conversion to tensors complete.")


if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    parser = ArgumentParser()
    parser.add_argument("--input_folder")
    parser.add_argument("--output_folder")
    parser.add_argument("--workers", default=4, type=int)

    args = parser.parse_args()

    if not os.path.exists(args.output_folder): os.makedirs(args.output_folder)

    main(args)

All the best.

This doesn't generate a meta.csv, am I right? It's necessary when using the SVGDataloader included in the library.