This is the supervisor repo of the "Data Science Laboratory Course" at KIT in 2022. Students worked on two subtasks:
- the Data Mining Cup 2022
- a research problem from the field of SAT solving
The repo provides files for preparing the datasets, some basic exploration, course-internal splitting, scoring, and demo submissions for that.
Additionally, Surveys/
contains PDF exports of questionnaires to evaluate the students' satisfication
(one survey at start of course, one after first task, one after second task).
We use Python with version 3.8
.
We recommend to set up a virtual environment to install the dependencies, e.g., with virtualenv
:
python -m virtualenv -p <path/to/right/python/executable> <path/to/env/destination>
or with conda
:
conda create --name ds-lab-2022 python=3.8
Next, activate the environment with either
conda activate ds-lab-2022
(conda
)source <path/to/env/destination>/bin/activate
(virtualenv
, Linux)<path\to\env\destination>\Scripts\activate
(virtualenv
, Windows)
Install the dependencies with
python -m pip install -r requirements.txt
If you make changes to the environment and you want to persist them, run
python -m pip freeze > requirements.txt
To make this environment available for notebooks, run
ipython kernel install --user --name=ds-lab-2022-kernel
To actually launch Jupyter Notebook
, run
jupyter notebook
The first task of the course is identical to the task of the Data Mining Cup 2022
.
We only add a course-internal splitting and scoring to compare the students' solutions.
Download the DMC task from the website.
Place the four CSVs in a folder called data/
in the folder Task_1_DMC_2022/
.
The notebook Exploration.ipynb
contains basic exploration (mainly statistics) of the four CSVs.
split.py
creates a temporal train-test split. It takes all orders up to a certain day as training data and the following four weeks for testing. It creates corresponding files with (training) oders, a submission template, and the solution.predict_ensembles.py
creates ensemble solutions from existing solution files. It uses statistical aggregates (like min, mode, etc.) and stacking.predict_no_replenishment.py
creates a baseline solution for the train-test split, constantly predicting no replenishment.predict_interpolated_orders.py
creates a more sophisticated prediction (which still ignores item features and uses no prediction model): For items purchased on at least two days by the particular customer in the training period, it computes how long the ordered amount lasted on average, and uses this information to estimate the next replenishment date based on the amount ordered last. If this date falls after the test period, or a user did not order an item on at least two dates (which happens in most cases), predict no replenishment.score.py
scores submissions of students for the course-internal train-test split. It also checks the validity of the submissions.check_submission_identity.py
checks whether identically-named submission files have the same content (= checks reproducibility).
The second task of the course works with features of SAT instances
from the Global Benchmark Database (GBD
).
We have two prediction targets, which are assigned to different teams of students:
- Is the instance satisfiable or not (column
result
in databasemeta
)? - To which family does the instance belong (column
family
in databasemeta
)?
Besides exploring the data, students should use classification as well as clustering approaches.
prepare_data.py
pre-processes the dataset:
- download databases with meta data and instance features from
GBD
- merge databases
- filter instances:
- known satisfiablity result
- at least 10 members in family (which is still quite few)
- no NAs in instance features
split.py
creates a stratified holdout split for both targets.cluster_kmeans.py
creates a clustering-based solution for predictions (which, unsurprisingly, is bad).predict_majority.py
creates a baseline solution that constantly predicts the majoriy class.predict_tree.py
creates a simple solution with the help of a decision tree.score.py
scores submissions of students for the course-internal holdout split. Works for both targets and classification as well as clustering.