As one of the most popular, versatile, and beginner-friendly programming langauges, Python can be used for a variety of tasks from gathering data to publishing websites.
This 5-part workshop series introduces participants to the Python programming language for working with text-based data. Participants will gain Python skills by gathering, cleaning, and exploring data about the current anti-trans legislation that is sweeping the United States. At the end of the series, participants will use the datasets that they create to train a small AI model to generate text.
The first workshop begins with core concepts in programming to understand digital forms of data and basic manipulations. The second and third workshops move to data gathering and processing with web scraping, APIs, and text cleaning methods. Participants will then spend the fourth and fifth workhops exploring their datasets with text analysis and deep learning tools. See a more detailed description of each workshop below.
The workshop website is built using Jupyter Book and Github Pages. To make changes to the workshop, please read the technical specifications section below.
Basic introduction to core concepts in Python programming. Grounds instruction in critical awareness of data and what happens to data at various levels of transformation and abstraction.
Teaches programmatic methods for extracting data from webpages using web scraping and APIs within an ethical approach. Advances core concepts of looping and conditional statements from introductory session and introduces object-oriented programming and working with code libraries. Participants will apply skills to gather data about current anti-trans legislation in the USA.
- libraries:
requests
,bs4
, andpandas
Explores preparing text data for analysis, with emphasis on removing unwanted elements that may skew analysis. Participants will continue to build skills in algorithmic thinking while learning to write functions and scripts for customizing and automating text cleaning processes.
- libraries:
pandas
,spacy
Explores methods for finding and analyzing textual patterns through popular tasks in Natural Language Processing. Participants practice writing code to annotate and extract text according to specific features from current “anti-trans” bills in the USA.
- libraries:
spaCy
With the anti-trans bills data that they prepared in previous workshops, participants practice fine-tuning a small Text Generation model and learn about how to use Machine Learning for research.
- libraries:
transformers
See more workshop offerings (including on Python) at the Princeton University library. We have upcoming workshops on working with data, digital publishing, and more.
Want to talk Python or another digital project or tool? Sign up for a consultation with Digital Scholarship at Princeton.
The front-facing website that hosts the workshops is built using Jupyter Book and displayed on Github servers via Github Pages.
Making changes to the website is much like making changes to any Github repository, but with the added step of pushing the changes to a new Github Pages branch. Below are the steps necessary to make changes and update the website accordingly. Please follow the steps below.
Before starting, you'll need to install a few pieces of software:
- Python (I use the Anaconda distribution, but any kind of Python works as long as it is a version of Python 3+.
- Jupyter-Book for building the website.
- Git versioning software for sending files to a Github server, where they will be hosted. An account on Github will also be necessary.
-
First, clone the repository onto your computer by typing the following into your command line. After that, you'll have your "book" (the Jupyter-Book repository) on your local machine.
git clone https://github.com/PULdischo/python-for-text.git
-
Second, make changes to the files as needed. Maybe you want to add a new page or a new workshop. Jupyter-book files can be in markdown or python notebooks (
.ipynb
files). If you are adding a new file or section to the workshop, make sure to indicate the new material in the_toc.yml
(the table of contents) files, so it will appear in the sidebar. To learn more about how to create and modify files, check the excellent documentation on Jupyter-Book. -
Third, you will "build" the book by running the following in your command line, making sure you are one directory above your book. For example, if you cloned the book into your Desktop folder, make sure you are in the Desktop folder (rather than the book's folder), when you run the code below.
The build process will create a
_build
folder in your book, which contains all of the html files necessary to display your content in a web browser.jupyter-book build [book's name]
-
After building the book, you can push your changes to Github. Here you can add, commit, and push changes like you would do for a normal repository.
cd bookname git add . git commit -m "updating files" git push
-
The final step will be another Github push, but this time to a new branch called
gh-pages
. Pushing to thegh-pages
branch allows us to upload the html files so viewers can see them rendered nicely on the browser.To push to
gh-pages
, you will need to install a software package calledghp-import
. To install that package, run the code below. (Note: you will only need to install the package once; every time after that, you can simply push your changes.)pip install ghp-import
Finally, back on your command line, you can push your changes to
gh-pages
using theghp-import
command:ghp-import -n -p -f _build/html
-
Note: only follow this step if you are setting up a completely new repository, such as on your own account. In this case, you need to tell Github explicitly to create a Github pages based off the
gh-pages
branch that you just pushed. Go to your Github repository's settings (check the toolbar at the top of your repo), select "Pages" from the tabs on the left, and configure your repo to build from thegh-pages
branch. Select this option from the dropdown under "Build and Deployment."
In a few minutes, your site should be visible at
https://PULdischo.github.io/bookname
, for this repo, the link is
https://PULdischo.github.io/python-for-text. If you're experienceing
problems, read more about pushing to Github
pages
Created by Filipa Calado, Digital Scholarship Specialist, Princeton University Library.
The first workshop, "Intro to Python," is adapted from the Graduate Center Digital Initiatives Digital Humanities Research Institute Python workshop. The opening challenge from this workshop takes text from the Feminist Data Manifest-No by M. Cifor, P. Garcia, et al.
For more instruction with Python, please see these books:
- WJB Mattingly's Introduction to Python for Humanists
- Melanie Walsh's Introduction to Cultural Analytics & Python
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.