quinnanya / proquest-tdm-word-vectors

Notebook and tutorial for running word vectors on data using ProQuest TDM Studio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Downloading nltk data

Nltk (Natural language toolkit) is a library for doing some basic text processing. It requires a language-specific data file, which you have to upload separately to the environment. Go to the NLTK Corpora page, download the Punkt Tokenizer Models. Unzip the file that downloads, open the PY3 folder inside the unzipped punkt folder. You should see a file called english.pickle -- that's what you'll need to upload to the ProQuest TDM environment.

Uploading things

  • Hit the file-folder-opening icon at the very top of the screen. Click on "Temporary files", then choose "Upload files".
  • Navigate to the punkt/PY3 folder and upload the english.pickle file.
  • Upload the tdm_wordvectors.ipyb file with the word vectors code, then close the window.
  • Back in the ProQuest TDM browser tab, click "Upload" then go to "My Files" then "Temporary Files", then select english.pickle. Hit "upload" to confirm in the browser tab. Do the same for the .ipynb file with the word vector code.

Setting up the nltk data

  • In the ProQuest TDM Jupyter notebook browser tab, go to New > Folder. This will create something called Untitled Folder. Check the checkbox next to it, then hit the "Rename" button in the upper left. Rename it to nltk_data.
  • Click on the ntlk_data folder and create a new folder inside it. Rename it to tokenizers.
  • Click on the tokenizers folder and create a new folder inside it. Rename it to punkt.
  • Click on the punkt folder and create a new folder inside it. Rename it to PY3 (make sure it's all caps).
  • Click on the PY3 folder. Click on the upload button in the browser tab, choose My Files, then Temporary Files, then choose the english.pickle file.

Go back to the main Jupyter Notebook directory by clicking on the Jupyter logo in the upper left.

Creating an environment

  • In the browser, go to New > Terminal
  • Type conda create -n vectors ipykernel nltk gensim pandas lxml bs4
  • Type y when it asks if you want to proceed
  • Type source activate vectors
  • Type python -m ipykernel install --user --name=vectors
  • You can close or leave the tab with the terminal now

If you shut down the notebook server, you'll have to do this starting with source activate vectors when you restart.

Running the notebook

Click on the tdm_wordvectors.ipynb file. You may see a "Kernel not found" error. From the dropdown, choose "vectors" then "set kernel". Follow the instructions in the notebook (e.g. which values to change to the name of your corpus directory, etc.)

About

Notebook and tutorial for running word vectors on data using ProQuest TDM Studio

License:BSD 2-Clause "Simplified" License


Languages

Language:Jupyter Notebook 100.0%