antgobar / job_tracker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Job Tracker ETL

An ETL pipeline app which fetches job data from USAJOBS

Brief

  • ETL script querying USAJOBS API
    • Search using the data engineering keyword
    • Prase data for results of interest to a job-seeker based in Chicago IL with 5 years of experience
    • At least the fields: PositionTitle, PositionURI, PositionLocation, PositionRemuneration
  • Load parsed results into the mongo database

Running this application locally

This application was created using docker version 20 and docker compose version 2 Make sure you create a .env file in the root directory with the keys API_USER=<your USAJobAPI username> and API_KEY=<your api key>

  • Run the application with ./run.sh (you may need to update permissions with chmod +x run.sh)
  • Open your browser on localhost:8000/docs to view the Swagger interface and trigger the ETL pipeline with the /etl endpoint
    • Default parameters match project brief requirements
  • View current stored jobs with the /jobs endpoint
  • Alternatively got to port 8081 to view the mongo UI (mongo express) using mexpress for username and password
  • Wipe the collection with /wipe

App design

  • FastAPI used as the main interface to trigger the ETL pipeline and view current database contents
  • Using the requests library to query the Jobs API provider
  • Passing in various query parameters e.g. location, job keyword, remuneration
  • Parse response and use dataclass to enforce record schema for ingestion by MongoDB
  • MongoDB is used here because multiple tables with relationships are not being created for simple ETL a NoSQL approach is appropriate
  • User upsert operation with bulk write to mongo to update existing or insert new records
  • Return results of upsert operation to view records added or modified

CI/CD

  • Pipeline using GitHub actions to test the application, build a docker image push it to AWS ECR and create an AWS Lambda from the image
  • The docker container has access to a Mongo Atlas cluster
  • Contact me to view the live Lambda

Secrets

Include the following secrets if you want to deploy this yourself

  • API_KEY: Jobs API key
  • API_USER: Jobs API user
  • AWS_ACCESS_KEY_ID
  • AWS_DEFAULT_REGION
  • AWS_ECR_IMAGE_URI
  • AWS_SECRET_ACCESS_KEY
  • MONGO_URI: e.g. Mongo Alass connection URI

Further ideas for implementing cloud deployment

Since this application would be triggered on a regular schedule e.g. daily a serverless approach is more ideal than a continuous runtime, at least if the processing and ETL loads are relatively small

  • Use AWS EventBridge to trigger the Lambda on a regular schedule
  • Email user updated jobs report
  • Using AWS DocumentDB as an analogue to MongoDb
  • IaC approach could be cloud formation to provision Lambda and EventBridge rules during CI/CD

Other considerations:

  • For an application which requires higher throughput this approach may not be optimal. E.g. I had to increase the Lambda memory allocation
  • In this case a more scalable solution could be to scale the AWS lambdas
  • Or implement a continuous runtime approach such as Kubernetes for horizontal scaling with replica sets or AWS ECS architecture
  • Eventually IO bottlenecks with MongoDB will need handling with sharding for example

Future state ...

  • Analytics: Extend user API interface to display results on dashboards
  • Further testing to include db.py unittests

About


Languages

Language:Python 94.4%Language:Dockerfile 4.9%Language:Shell 0.7%