🎥📈 Vision-Language Attention is All You Need 🧠💡

Video Understanding, Recommendation, and Generation via Multi-Agent Foundational Multimodal Model

🎓 Capstone Project: Deep Learning and Multimodal AI for TikTok Video for E-Commerce in Indonesia

This project aims to develop an AI-driven system for recommending and generating product videos for e-commerce, leveraging advances in generative AI and large language models. Focused on the Indonesian market, it addresses the challenge of content curation in digital marketing by automating video recommendations and enhancements.

In this paper, we propose VIDIA (Vision-Language Digital Agent), VIDIA operates as a multiagent system where each agent specializes in different aspects of video analysis and generation, including product recognition, scene understanding, and content enhancement. The system leverages a vision-language pre-training approach, learning a joint representation space that allows for seamless cross-modal reasoning and generation.

🚀 Getting Started

These instructions will get your copy of the project up and running on your local machine or Google Colab for development and testing purposes.

📋 Prerequisites

Python 3.8 or higher
Access to high computational resources (GPU recommended for video processing)
A YouTube API key for collecting video data
An OpenAI API key for leveraging models like GPT-4
A Stability AI API key for latent / stable diffusion models for image / video generation

🛠️ Installation

Clone the Repository

git clone https://github.com/sumankwan/Vision-Language-Attention-is-All-You-Need-public.git
cd Vision-Language-Attention-is-All-You-Need-public

Set Up a Virtual Environment (optional)

python -m venv venv

# On Windows
venv\Vision-Language-Attention-is-All-You-Need\activate

# On Unix or MacOS
source venv/bin/activate

Install Required Libraries
```
pip install -r requirements.txt
```
This requirements.txt file should include necessary libraries like torch, transformers, ffmpeg-python, pytube, whisper, and others relevant to the project's technology stack.

⚙️ Configuration

Load API Keys
- For data collection, add your YouTube API key in api_keys.txt file in the project directory
- For AI system inference, add OpenAI and Stability AI API keys in the same file
Set Up Google Colab
- Open the project in Google Colab.
- Connect the Colab notebook to your Google Drive.

Configure File Structure

This project is organized into several directories, each serving a specific function within the Multimodal AI Pipeline. Ensure the following file structure is set up in your Google Drive:

Project/
├── Agents/
│   ├── final/        # Stores final versions of processed videos
│   ├── adjusted/     # Contains adjusted videos after post-processing
│   ├── image/        # Used for storing images used or generated in the pipeline
│   └── video/        # Contains raw and intermediate video files used in processing
├── Audio/            # Audio files used or generated by the pipeline
├── DataFrame/        # DataFrames and related data files used for processing and analytics
├── Example/          # Example scripts and templates for using the pipeline
├── Frames/           # Individual frames extracted from videos for processing
├── MVP/              # Minimal Viable Product demonstrations and related files
├── ObjectDetection/  # Scripts and files related to object detection models
├── Output/           # Final output files from the pipeline, including videos and logs
├── Shorts/           # Short video clips for testing or demonstration purposes
├── Text/             # Text data used or generated, including scripts and metadata
├── Video/            # Directory for storing larger video files
└── Multimodal_AI_pipeline/ # Core scripts and modules of the Multimodal AI pipeline

Data and AI Inference (optional)

Execute the multimodal_AI_pipeline.ipynb notebook to start the data collection and run through our multimodal processing for later downstream tasks
data_post_processing.xlsx file is used to organize output from data pipeline from the AI system

▶️ Running the Application

To execute the main application script:

python main.py

main.py should be the entry point of your application, orchestrating the workflow of video data gathering, processing, and recommendation generation based on user inputs.

🖥️ Using the System

You can interact with the system using the provided Telegram bot:

VideoTikTokShopBot

🔧 Development Notes

Data Collection: YouTube API is used to gather video content relevant to the Indonesian e-commerce market, focusing on collecting a diverse array of product videos with different categories. The multimodal_AI_pipeline.ipynb notebook is our data pipeline that leverages multiagent multimodal AI system for video understanding.
Cost: Handling video data is computationally demanding, including its analysis using large foundational models
Future Direction: Future plans involve incorporating multiple smaller models like Llama-3 to improve multi-agent collaborations

🙏 Acknowledgments

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020. Retrieved from https://arxiv.org/abs/2103.00020

Jin, Y., Wu, C.-F., Brooks, D., & Wei, G.-Y. (2023). S3: Increasing GPU Utilization during Generative Inference for Higher Throughput. arXiv preprint arXiv:2306.06000. Retrieved from https://arxiv.org/abs/2306.06000

Zhu, Z., Feng, X., Chen, D., Yuan, J., Qiao, C., & Hua, G. (2024). Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation. arXiv. https://arxiv.org/abs/2403.12042

Shen, W., Li, C., Chen, H., Yan, M., Quan, X., Chen, H., Zhang, J., & Huang, F. (2024). Small LLMs Are Weak Tool Learners: A Multi-LLM Agent. arXiv. https://arxiv.org/pdf/2401.07324

Zuo, Q., Gu, X., Qiu, L., Dong, Y., Zhao, Z., Yuan, W., Peng, R., Zhu, S., Dong, Z., Bo, L., & Huang, Q. (2024). VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model. arXiv. https://doi.org/10.48550/arXiv.2403.12010

Zhang, J., Chen, J., Wang, C., Yu, Z., Qi, T., Liu, C., & Wu, D. (2024). Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing (Version 2). arXiv. Retrieved March 22, 2024, from https://arxiv.org/abs/2403.12042v2

Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., Liang, B., Wong, K.-F., & Zhang, L. (2024). CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility. arXiv. https://arxiv.org/pdf/2403.12035

Romero, D., & Solorio, T. (2024). Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering. arXiv. https://doi.org/10.48550/arXiv.2402.10698

Wang, B., Li, Y., Lv, Z., Xia, H., Xu, Y., & Sodhi, R. (2024). LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing. ACM Conference on Intelligent User Interfaces (ACM IUI) 2024. https://doi.org/10.48550/arXiv.2402.10294

Tóth, S., Wilson, S., Tsoukara, A., Moreu, E., Masalovich, A., & Roemheld, L. (2024). End-to-end multi-modal product matching in fashion e-commerce. arXiv. https://arxiv.org/abs/2403.11593

Huang, X., Lian, J., Lei, Y., Yao, J., Lian, D., & Xie, X. (2023). Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations. arXiv. https://arxiv.org/abs/2308.16505

Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., & Sun, L. (2024). Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv. https://arxiv.org/abs/2402.17177

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. NeurIPS 2023. arXiv. https://arxiv.org/abs/2304.08485

Lu, Y., Li, X., Pei, Y., Yuan, K., Xie, Q., Qu, Y., Sun, M., Zhou, C., & Chen, Z. (2024). KVQ: Kwai Video Quality Assessment for Short-form Videos. arXiv. https://arxiv.org/abs/2402.07220v2

Wang, X., Zhang, Y., Zohar, O., & Yeung-Levy, S. (2024). VideoAgent: Long-form Video Understanding with Large Language Model as Agent. arXiv. https://arxiv.org/abs/2403.10517

Zhang, Z., Dai, Z., Qin, L., & Wang, W. (2024). EffiVED: Efficient Video Editing via Text-instruction Diffusion Models. arXiv. https://arxiv.org/abs/2403.11568

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything: A new task, model, and dataset for image segmentation. arXiv. https://arxiv.org/abs/2304.02643

Yuan, Z., Chen, R., Li, Z., Jia, H., He, L., Wang, C., & Sun, L. (2024). Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. arXiv. https://arxiv.org/abs/2403.13248

Yu, J., Zhao, G., Wang, Y., Wei, Z., Zheng, Y., Zhang, Z., Cai, Z., Xie, G., Zhu, J., & Zhu, W. (2024). Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation (Version 2). arXiv. https://arxiv.org/abs/2403.12425v2

Schumann, R., Zhu, W., Feng, W., Fu, T.-J., Riezler, S., & Wang, W. Y. (2024). VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View. arXiv. https://arxiv.org/abs/2307.06082v2

Liu, Z., Sun, Z., Zang, Y., Li, W., Zhang, P., Dong, X., Xiong, Y., Lin, D., & Wang, J.† (2024). RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition. arXiv. https://arxiv.org/abs/2403.13805v1

Huang, Z., Zhang, Z., Zha, Z.-J., Lu, Y., & Guo, B. (2024). RelationVLM: Making Large Vision-Language Models Understand Visual Relations. arXiv. https://arxiv.org/abs/2403.13805

Wang, Z., Xie, E.*, Li, A., Wang, Z., Liu, X., & Li, Z. (Year). Divide and Conquer: Language Models Can Plan and Self-Correct for Compositional Text-to-Image Generation. arXiv. https://arxiv.org/pdf/2401.15688

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R. L., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral 7B v0.1: A 7-billion-parameter language model for performance and efficiency. arXiv. https://arxiv.org/abs/2310.06825

Yang, D., Hu, L., Tian, Y., Li, Z., Kelly, C., Yang, B., Yang, C., & Zou, Y. (2024). WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs. arXiv. https://arxiv.org/abs/2403.07944

Wang, Y., Li, K., Li, X., Yu, J., He, Y., Chen, G., Pei, B., Zheng, R., Xu, J., Wang, Z., Shi, Y., Jiang, T., Li, S., Zhang, H., Huang, Y., Qiao, Y., Wang, Y., & Wang, L. (2024, March 22). InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding. arXiv:2403.15377 [cs.CV]. https://arxiv.org/abs/2403.15377

Chan, S. H. (2024). Tutorial on diffusion models for imaging and vision. arXiv:2403.18103 [cs.LG]. https://doi.org/10.48550/arXiv.2403.18103

Li, C., Huang, D., Lu, Z., Xiao, Y., Pei, Q., & Bai, L. (2024). A survey on long video generation: Challenges, methods, and prospects. arXiv:2403.16407 [cs.CV]. https://doi.org/10.48550/arXiv.2403.16407

Shahmohammadi, H., Ghosh, A., & Lensch, H. P. A. (2023). ViPE: Visualise Pretty-much Everything. arXiv preprint arXiv:2310.10543. Retrieved from https://arxiv.org/abs/2310.10543

Lian, J., Lei, Y., Huang, X., Yao, J., Xu, W., & Xie, X. (2024). RecAI: Leveraging Large Language Models for Next-Generation Recommender Systems. arXiv preprint arXiv:2403.06465. Retrieved from https://arxiv.org/abs/2403.06465

Zhou, X., Arnab, A., Buch, S., Yan, S., Myers, A., Xiong, X., ... & Schmid, C. (2024). Streaming Dense Video Captioning. arXiv preprint arXiv:2404.01297. Retrieved from https://arxiv.org/abs/2404.01297

Liu, R., Li, C., Tang, H., Ge, Y., Shan, Y., & Li, G. (2024). ST-LLM: Large Language Models Are Effective Temporal Learners. arXiv preprint arXiv:2404.00308. Retrieved from https://arxiv.org/abs/2404.00308

Wu, B., Chuang, C.-Y., Wang, X., Jia, Y., Krishnakumar, K., Xiao, T., Liang, F., Yu, L., & Vajda, P. (2023). Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis. arXiv preprint arXiv:xxxx.xxxxx. Project website available at [URL provided in your document]

Wu, J., Li, X., Si, C., Zhou, S., Yang, J., Zhang, J., Li, Y., Chen, K., Tong, Y., Liu, Z., & Loy, C. C. (2024). Towards Language-Driven Video Inpainting via Multimodal Large Language Models. arXiv preprint arXiv:2401.10226. Retrieved from https://arxiv.org/abs/2401.10226

Ding, K., Zhang, H., Yu, Q., Wang, Y., Xiang, S., & Pan, C. (2024). Weak distribution detectors lead to stronger generalizability of vision-language prompt tuning. arXiv:2404.00603 [cs.CV]. https://doi.org/10.48550/arXiv.2404.00603

👨‍💻 Author

By Suhardiman Agung with the guidance from Professor Daniel Lin

sumankwan / Vision-Language-Attention-is-All-You-Need-public