goweiting / project-infnet

UG4 Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Informatics Collaboration Network & Topic Network

Focusing on the School of Informatics, University of Edinburgh, a collaboration network was created using informtion from the University's collection of research publications Edinburgh Research Explorer. More details in infnet-scrapper.

Using the publications scrapped from the research explorer, topic models were inferred, and a topic-similarity networks[1] were generated. A collaboration network was also created, visualised and analysed.


Directory

  1. Data

    • bin
      • scrapy : scripts for scraping using scrapy
      • pdfminer: contains binary from pdfminer.six
      • scripts used to process PDFs using pdfminer
    • data_dblp : dblp dataset, but metadata of publications are not stored due to the size of the dataset. We only store tokenised pickled files and dictionary in it.
    • data_schoolofinf : Informatics dataset retrieved in Jan 2018
    • notebooks : corresponds to steps taken to process and generate lookup tables for the remaining steps.
  2. infnet-analysis

    • notebooks : contain the jupyter notebook used to generate each informatics network.
      • community detection and homophily test is carried out in analysis.ipynb
  3. embedding

    • notebooks : creation of topic-similarity networks
  4. topicModel

    • notebooks : generate topic models using Gensim's implementation of LDA; also explore the performance of each model
    • src : contain scripts to generate each topic model

Setting up

The project is still in development. To use the datasets and run the notebooks on your system, follow the following instruction:

  1. The project is developed in python3.6. Using anaconda to setup the virtual environment will be the easiest. You can get a copy of miniconda by issuing the following command:
$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh # For MacOSX
$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86.sh # For linux/ubuntu
$ bash Miniconda3-latest-MacOSX-x86_64.sh # Install miniconda onto your system
$ echo "export PATH=\""\$PATH":$HOME/miniconda3/bin\"" >> ~/.benv
$ source ~/.benv

*** NOTE the use of python3 instead ***

# Create conda environment (name infnet3) for project:
# Also install essential packages across all modules:
$ conda create -n infnet3 python=3 pandas matplotlib jupyter ipython ipykernel
$ source activate infnet3 # Activates the environment
(infnet3) $ <--- this shows the successfull acitvation of the environment.

Now, we have to install required python packages. This list is updated as the project progresses:

  1. For data pre-processing, additional packages are installed:
(infnet3) $ conda install scrapy # for scrapping the research explorer
(infnet3) $ conda install nltk # this is used for creating tokens for topic modelling

1a. To configure NLTK, executing the following in a new terminal with infnet3 activated :

(infnet3) $ python to launch a python3 shell
> import nltk
> nltk.download('stopwords') # select `yes` when prompt.
> nltk.download('WordNet')
  1. For infnet-analysis:
(infnet3) $ conda install networkx numpy
(infnet3) $ pip install python-louvain # community detection package
  1. For topic modelling:

For topic modelling using latent diriclet allocation

$ conda install gensim # to generate LDA
$ pip install pyldavis # for visualisation of the LDA

For data exploration, visualisation of data and clustering:

$ conda install scikit-learn # for k-means, manifold, dbscan...
$ conda install -c conda-forge hdbscan

About

UG4 Project


Languages

Language:HTML 51.2%Language:Jupyter Notebook 48.7%Language:Python 0.1%Language:Shell 0.0%