Paper: https://keep-up-sharma.github.io/mind_your_own_kernel/ (not professional but i tried)
Introducing Hybrid convolution and dense layers
Same function but 10x-20x faster and lighter
Convolutional Neural Networks (CNNs) have achieved remarkable success in various computer vision tasks such as image classification, object detection, and segmentation. The Conv2d layer is a fundamental building block of CNNs, which applies a set of fixed filters (kernels) to extract features from the input image. However, storing and computing these kernels can be computationally expensive and memory-intensive, especially for large images or complex architectures. To address this issue, various techniques such as depthwise separable convolutions, dilated convolutions, and group convolutions have been proposed to reduce the number of parameters and computations. However, these methods still require storing a large number of pre-defined kernels.
In this paper, we propose a custom Conv2d class, where kernels are predicted dynamically using a hybrid dense layer for each patch of the input image. This approach saves time and storage as the model doesn't have to remember kernels. The proposed method learns to predict the kernels that best extract features from the input patch, based on the patch's content. Our approach is inspired by recent works on dynamic convolution, which has shown promising results in various tasks such as object detection and segmentation.
Install use the layers
import hybrids
A = hybrids.DynamicConv(2,4)
I have added training code to following notebook: (might need some fixes)
Samples of my 10 mb encoder (trained 3-4 hours) Contains 1 layer (4 kernels (8x8)) stride 8, hidden size=512
Samples of my 479 kb encoder (trained less than 1 hour) Contains 1 layer (4 kernels (8x8)) stride 8, hidden size=128
Impressed?
Samples of my 89 kb encoder (trained less than 20 minutes) Contains 1 layer (4 kernels (8x8)) stride 8, hidden size=8
Not convinced?
Samples of my 1 mb autoencoder (trained less than 50 minutes) (Same latent space as stable diffusion)
from hybrids import encoder
enc = torch.load(encoder.pt)
#from diffusers import AutoencoderKL
# sdvae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
# sdvae = sdvae.to('cuda')
# use decoder from this
# sdvae.decode(enc(input))