tade0726 / careerhub_scrape

This repository houses a Scrapy project that specializes in scraping job postings from the University of Adelaide Careerhub.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Careerhub Scrape

This repository houses a Scrapy project that specializes in scraping job postings from the University of Adelaide Careerhub and storing the results in a MongoDB database. The project is optimized for swiftly scraping hundreds of pages in a matter of minutes and is supplemented with Docker support to facilitate setup and execution.

The spider is programmed to navigate through two primary jobs channels, namely graduate-employment and internship, in search of job detail links. Additionally, it will extract pertinent information, such as open and close dates, from the job details.

Be awere: due to UI changes, the codes related to XML parsing may be broken.

Web examples

Navigation page: navigation page of career hub

Data been scraped: data example in mongodb

Table of Contents

Getting Started

To get started, clone the repository:

git clone https://github.com/username/careerhub_scrape.git

Docker Setup

To use Docker, ensure that you have Docker and Docker Compose installed. Build the Docker environment and start the containers with the following commands:

docker-compose build
docker-compose up

Alternatively, you can use the provided Makefile commands:

make build
make up

Create a folder of mongodb-data/ in the root, it will be the meta data location for mongodb

Local Python Environment

To set up a local Python environment with conda, use the provided environment.yml file with the following commands:

conda env create -f environment.yml

Alternatively, you can use the provided Makefile command:

make build_python_env

Running the Spider

Because it is within university network, you must be a student of Unversity of Adelaide, and put your student id and pass in a file .secrets in the root diretory, then they will be read in the settings.py:

Example of .secrets

# Access the SECRET_KEY environment variable
export USERNAME="your name"
export PASSWORD="your pass"

To start the spider, use the following command:

source .secret && cd career_hub && python main.py

Alternatively, you can use the provided Makefile command:

make run

Contributing

If you would like to contribute to the project, please fork the repository, make your changes, and submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE

About

This repository houses a Scrapy project that specializes in scraping job postings from the University of Adelaide Careerhub.

License:MIT License


Languages

Language:Python 93.7%Language:Dockerfile 3.3%Language:Makefile 2.9%