Persian-Visual-Question-Answering

Visual Question Answering in Persian Based on deep learning techniques

Abstract: These days, image question and answer systems are widely used in order to automatically answer questions related to the content of images. It is possible to use a video question and answer system to predict the correct answer to a question based on an image and a text question about the image as input. Ideally, these systems should predict answers with high accuracy. Various factors, including the choice of appropriate neural networks and the choice of appropriate datasets, play an important role in achieving this goal. Additionally, different attention mechanisms can be used in the model to improve its performance. Few studies have been conducted on visual question and answer systems in Persian. Therefore, we introduce a visual question and answer system in Persian in this article. We used convolutional neural networks with ResNext architecture for image processing in the proposed model, which was used for the first time in video question and answer applications. We also used a recurrent neural network of the type of long-term and bilateral short-term memory to process the input text. As part of the proposed model, two types of attention mechanisms are employed. The results of this study demonstrate that the predicted answer in the proposed model of this article is the most accurate among the Persian examples.

Keywords: visual question and answer system, convolutional neural network, recurrent neural network, attention mechanism

Author: Amir Shokri - Alireza Gholamnia

{
  author      = {Amir Shokri and Alireza Gholamnia},
  email       = {amirsh.nll@gmail.com, gholamniareza@gmail.com}
  title       = {Visual Question Answering in Persian Based on deep learning techniques},
  conference  = {18th Computer Science and Engineering Conference and Information Technology (CECCONF18)},
  year        = {2023},
  url         = {https://www.en.symposia.ir/CECCONF18},
}

پاسخ‌دهی خودکار به پرسش‌های مربوط به محتوای تصاویر به زبان فارسی با استفاده از تکنیک‌های مبتنی بر یادگیری عمیق

چکیده: امروزه پاسخ‌دهی خودکار به پرسش‌های مربوط به محتوای تصاویر (سیستم پرسش و پاسخ تصویری) کاربرد فراوانی دارد. در سیستم‌های پرسش و پاسخ تصویری، یک تصویر و یک سوال متنی در مورد تصویر به عنوان ورودی در نظر گرفته می‌شود و این سیستم باید پاسخ صحیح به پرسش مطرح شده را پیش‌بینی کند. هدف اصلی در این سیستم‌ها بالا بودن دقت صحت پاسخ پیش‌بینی‌شده است. برای این منظور عوامل مختلفی از جمله انتخاب شبکه‌های عصبی مناسب جهت پردازش ورودی‌ها و انتخاب مجموعه‌داده مناسب بسیار مهم است. همچنین استفاده از انواع مختلف سازوکار توجه در مدل می‌تواند باعث بهبود عملکرد کلی سیستم پرسش و پاسخ تصویری شود. تا به امروز پژوهش‌های اندکی در مورد سیستم‌های پرسش و پاسخ تصویری به زبان فارسی انجام شده است. از همین رو در این مقاله به معرفی یک سیستم پرسش و پاسخ تصویری به زبان فارسی پرداختیم. در مدل پیشنهادی، ما از شبکه عصبی کانولوشنی با معماری ResNext جهت پردازش تصویر استفاده کردیم که برای اولین بار در سیستم پرسش و پاسخ تصویری استفاده شده است. برای پردازش متن ورودی نیز از شبکه عصبی بازگشتی از نوع حافظه کوتاه مدت طولانی دوسویه استفاده کردیم. همچنین از دو نوع سازوکار توجه در مدل پیشنهادی استفاده شده است. نتیجه حاصل شده نشان می‌دهد که دقت صحت پاسخ پیش‌بینی شده در مدل پیشنهادی این مقاله، بالاترین مقدار بدست آمده نسبت به نمونه های موجود به زبان فارسی است.

واژگان كليدي: سیستم پرسش و پاسخ تصویری، شبکه عصبی کانولوشنی، شبکه عصبی بازگشتی، سازوکار توجه

نویسندگان: امیر شکری - علیرضا غلام نیا

amirshnll / Persian-Visual-Question-Answering

Persian-Visual-Question-Answering

Visual Question Answering in Persian Based on deep learning techniques

پاسخ‌دهی خودکار به پرسش‌های مربوط به محتوای تصاویر به زبان فارسی با استفاده از تکنیک‌های مبتنی بر یادگیری عمیق

About

Languages