STACK Student Response Analysis for Sage Foundation Ethical AI Hackathon

A ML algorithm capable of conducting an in-depth analysis of students' responses to STACK questions for the Ethical AI Hackathon promoted by Sage Foundation.

STACK is the world-leading open-source online assessment system for mathematics and STEM. It is available for Moodle, ILIAS and as an integration through LTI.

The algorithms used were CA for discovery lexical similarity between students' incorrect answers and K-means to cluster them.

Team Introduction & Understanding the Problem
Data Cleaning
Python Scripts & Visualisation
ML algorithms
Presentation
Future Improvements
Team & Researchers

1. Team Introduction & Understanding the Problem

Review of the sample data

Link to Sample Data

Hackathon Challenge

Our challenge in this hackathon is to develop a machine learning algorithm to analyze students' responses to STACK questions. The aim is to classify correct vs. incorrect responses, further delve into the types of incorrect responses, group similar incorrect responses, and identify any outlier responses.

Aim

To devise an algorithm that effectively provides an in-depth analysis of students' answers to STACK questions.

Specific Objectives

Classification of Correct vs. Incorrect Responses
Multilevel Classification of Incorrect Responses (Predicted vs. unpredicted responses using PRT paths)
Cluster Analysis - Grouping Similar incorrect responses
Anomaly Detection Based on Question Text

2.Data Cleaning

For the purposes of our analysis, only the finished attempts are considered.

Link to Data Cleaning Script

3. Python Scripts & Visualisation

Each objective was approached with a dedicated Python script, followed by visualization to represent the analysis results.

Writing Python Scripts

Script for Objective 1-2: Link to the Code
Script for Objective 3-4: Link to the Code

4. Machine Learning Analysis Summary

Contingency tables: for each type of question, a contingency table of students'answer was build using as vocabulary the characters present in each response.
Correspondence Analysis (CA): for each type of question, 2D CA was performed on predicted and not predicted wrong students'answers in order to analyzes lexical (dis)similarities between them.
K-means: used for clustering to understand common errors for each type of question, using as input the results from each CA.
Data Saving and Retrieval: save analyzed data for future use or further analysis.

5. Presentation

The findings, algorithm, and insights were compiled and documented for presentation to the Hackathon judges.

Link to PPT Presentation

6. Future Improvements

After individual testing, all code blocks should be integrated into a single program.
Increase num of dimensions for Correspondence Analysis (3D).
Add mathematical functions and symbol to the vocabulary for creating contingency table.
Choose effective num of clusters based on the better view of data from 3D CA.
Create API to fetch this clustering data and work as an input to the STACK system.

7. Team & Researchers

Below are the contributors to this project:

Team Members

Umang Murawat: LinkedIn team-leader
Gaurav Khetwal: LinkedIn
Navjot Singh: LinkedIn
Davinder Singh
Jivan Goyal

Lead Researchers

Valeria Insigna: LinkedIn
Zuma Zevick: LinkedIn
George Osang

References

Google Colab: https://colab.research.google.com
Prince for Correspondence Analysis: https://pypi.org/project/prince/
Plotly for Interactive Plots: https://plotly.com/python/
KMeans Clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

valinsogna / EthicalAI-STACKAnalysis