Video Generation Based on Short Text Description (2019)

Over a year latter, I have decided to add README to the repository, since some people find the latter useful even without a description. I hope this step will make the results of my work more usable for those who are interested in the problem and stumble upon the repository when browsing the topic on GitHub.

Example of Generated Video

Unfortunately, I have not saved videos generated by the network, since all the results remained on the working laptop which I hand over at the end of the internship. The only thing left is the recording that I did on my cellphone (Sorry if this makes your eyes bleed).

What is on the gif? There are 5 blocks of images stacked horizontally. Each block contains 4 objects, selected from '20bn-something-something-v2' dataset and belonging to the same category "Pushing [something] from left to right" ¹ (~1000 samples). They are book (top left window), box (top right window), mug (bottom left window), and marker (bottom right window), pushed along the surface by hand. The number of occurrences in the data subset for the corresponding objects is 57, 43, 9, 55.

The generated videos are diverse (thanks to zero-gradient penalty) and about the same quality as the videos from the training data. There are no tests conducted on the validation data.

¹ Yep, exactly "from left to right" and not the other way around as you can read it on the gif (it is a typo). However, it is good for validation purposes to make new labels with the reversed direction of movement or new (but "similar", e.g., in space of embeddings) objects from unchanged category.

Navigating Through SRC Files

data_prep2.py	Video and text processing (based on text_processing.py)
blocks.py	Building blocks used in models.py
visual_encoders.py	Advanced building blocks for image and video discriminators
process3-5.ipynb	Pipeline for the training process on multiple gpus (3-5 is a hardcoded range of gpus involved)
pipeline.ipynb	Previously served for the same purpose as pipeline.py but hadrcoded range was 0-2. Now it is unfinished implementation of mixed batches
legacy (=obsolete)	Early attempts and ideas

There is also a collection of references to articles relevant (at the time of 2019) to text2video generation problem.

Dependencies

Dockerfile would be helpful here...
Or I should have written this section earlier.

torch 1.2.0
nvidia/cuda 10.1
at least one gpu available
some other prerequisites?

About

Text to Video Generation Problem

MIT License

Languages

Language:Python 67.9%Language:Jupyter Notebook 32.1%