This repository is part of LSE DS105L 2022/23, for a lecture entitled "π Merge operations & π¦ practical tips for code organisation".
The major focus will be on how to work effectively as a group using GitHub, based on the feedback I received from Shuyu and general interactions with students over Slack/Office Hours.
I could have taken a passive approach and just demonstrate things to you, but I would rather transform this into a workshop where you can learn while practicing.
Here is how it's going to work:
Part ONE
βοΈ Setup
-
I will create a repository from the jonjoncardoso/data-science-workflow template and I will edit the README.md to remove the template-related text.
-
I will add a Jupyter Notebook with some web scraping code that is not greatly optimised to use pandas as we have been learning in this course...
π Create an issue
- I will create a GitHub Issue with a feature request to optimise the code.
- Anyone in the audience will be welcome to comment on this GitHub issue with suggestions for code optimisation.
- Once we found a solution that we're happy about, we will be ready to close the issue. But I won't close it straightaway!
π΄ Branching
Instead of modifying it directly in my notebook, I will demonstrate how groups can work in parallel on GitHub.
- I will open a separate branch, dedicated to that issue, and then I will make my changes there and
git push
- Then, I will open a Pull Request and ask some of you to validate my changes.
- Once we got approval from you, I will
git merge
changes tomain
- We will look at the
git
tree - I will tell you about a common practice of using a
develop
vs amain
branch.
This whole process is a more professional set of practices for using Git and it is commonly known as the Gitflow workflow.
Part TWO
Now I will move my relevant code to a python script and I will invoke it from the Jupyter notebook. I will explain why and when it is good to do so. Then, I will open a new issue with an exercise on data pre-processing. Everyone will now try to work out a solution for the exercise using Gitflow!
- Branch from
develop
and give it a meaningful name. - Push your branch to GitHub.
- Now work on your changes, commit and push them as you like.
- Once ready, open a pull request from your branch to
develop
and tag me (@jonjoncardoso) as a reviewer. - I will review a few and add feedback notes on the spot.
- Hopefully, some of the solutions will be merged!
Part THREE (time allowing)
- I will demonstrate the use of GitHub projects
- I will show you how I use GitHub milestones and how I set deadlines in there.
- I will show you how to create your own Python package and install it with
pip
.
π§° Dev Setup
-
Install Python 3.8 or higher on your computer.
-
Create a new conda environment:
conda create -y -n=venv-ds105 python=3.10.8
-
Activate the environment and make sure you have
pip
installed inside that environment:
# the exact `activate` command will vary depending on your OS
conda activate venv-ds105
π‘ Remember to activate this particular conda
environment whenever you reopen VSCode/the terminal.
- Install required libraries
pip install -r requirements.txt
Now, whenever you open a Jupyter Notebook, you should see the venv-ds105
kernel available.
- Dr. Jon Cardoso-Silva is DS105L's course convenour and creator of this exercise!