Repro-VAE Scene Detection in Anime

This is the course project of 11785 CMU, named Repro-VAE Scene Detection in Anime

Team Members: Dijing Zhang, Siqiao Fu, Yiping Dong, Zhengyang Zou

Abstract

2D animation has become a popular category in the field of films. With the help of digital tech, artists are able to comprise great masterpieces in fairly short amount of time. Beyond the fine scenery, the meaning inside every frame is vastly explored for various reasons. For example, scene classification is one of the most explored application when it comes to animation, which can facilitate a wide-range of tasks like anime to text, cross anime scene retrieval and human-centric storyline construction. There have been some supervised learning methods that tackle the problem of scene change detection in animation. However, they all require painful pre-processing to get the ground truth of scene change labels, which can only be acquired by long and tedious manual work. Our goal in this article is to come up with an unsupervised model and get the job done. The significant contribution will lie in the elimination of labeling the whole anime as pre-processing.

We use VAE as the baseline model. Although VAE may not seem anything related to the topic of scene change detection, it helps compress the image into a lower but more informative latent space as the representation of that certain image. This compressed representation significantly helps in the task of scene change detection, which will be illustrated in the following acticle. Beyond that, we explored a novel training method to regularize the latent space and we call it "reprojection error". During experiments we found that it did improve the accuracy in most cases. The main contribution:1) achieved reasonable scene change accuracy without the help of labeled training dataset; 2) explored VAE with "reprojection" regularization term, the repro-VAE we create seems better in the field of representative learning.

Dataset Your Name (Kimi no Na wa)

We use 142p as our dataset in the project. The source image size is 189×142, which will be re-scaled into 64×64.

Defination of Scene Change

We define scene change as there exists a scene change in the image, like the whole background changes from forest to city, or from home to classroom. Or a great change of POV. The main character or certain objects change doesn't lead to a scene change. Here is a example.

Architecture and Model

Based on β-VAE architecture and introduce our innovation idea: reprojection loss. Create one VAE named "repro-VAE". refer to [Anand Krishnamoorthy, PyTorch-VAE, (2020), GitHub repository, https://github.com/AntixK/PyTorch-VAE/tree/master/models]

Here is the architecture of our model (refer to Hung-yi Lee’s lecture https://www.youtube.com/watch?v=0CKeqXl5IY0&t=1650s)

Here is the visualization explanation of reprojection loss.

Here is the detailed architecture

Results

This is one example of reconstruction images. Because what we want is to detect the scene change instead of reconstruction images, the image quality is not very good but we can still tell the basic frame.

These are the evluation of our baseline model and repro-VAE model about the accuracy of scene change detection

Baseline model

Repro-VAE model

As far as we can see, our repro-VAE has promising increase in the accuracy compared with the baseline VAE model. Its further potential needs exploration.

To be clear, we also visualize the latent space by using t-SNE. Here is the illustration.

As we can see, the consective images without scene changes are clustered into separate groups.

Here are some visualizaiton of scene change detection

True Positive

Dynamic changes-False Negative

True Negative

False Postive

Video presentation

Youtube link: https://www.youtube.com/watch?v=8YoGIvvyqGs&list=PLp-0K3kfddPw7yEP_cICv9Glt237KNpSx&index=17

How to train the model and do the inference

Platform: Colab

Opts.py stores all the hyperparameters that you can refer to. Here is the hyperparameters we are using to get the best performance:

Do training:

!python main.py --val_folder [Your validation dataset folder] --train_folder [Your training dataset folder] --bs 256  --hidden-dims [32, 64, 128, 256] --max_iters 100 --loss_type H --lr 0.0001 --latent_dim 10 --tau 200 --beta 4 --output_folder [Your result folder]

Do inference:

!python inference.py --beta 4 --latent_dim 10 --bs 256 --span 1 --image_folder [Your validation dataset folder]  --model_folder [Your model_state folder]

Get Acc:

!python pure_test.py --labels [Your labels file] --dictionary [Your result npy folder]

splinter21 / repro-VAE-scene-detection-in-2D-anime