Urban Sound Classification Using PyTorch Vision Transformer

In this project, I've implemented the Vision Transformer (ViT) architecture to tackle the task of classifying urban sounds.

My goal is to replicate the ViT computer vision model described in the paper titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" and adapt it for classifying urban sounds. I've applied this model to the UrbanSound8K dataset.

Libraries Used

To accomplish this project, I've utilized several libraries:

Results

In the "results" folder, you can find a series of CSV files for comparing different image sizes. You can download them to your environment and inspect them in Section 9 with the plot_summary() function.

About

Implementation of Vision Transformer, ViT paper, for urban sound classification. In Pytorch

Languages

Language:Jupyter Notebook 100.0%