An ETL pipeline app which fetches job data from USAJOBS
- ETL script querying USAJOBS API
- Search using the
data engineering
keyword - Prase data for results of interest to a job-seeker based in Chicago IL with 5 years of experience
- At least the fields:
PositionTitle
,PositionURI
,PositionLocation
,PositionRemuneration
- Search using the
- Load parsed results into the mongo database
This application was created using docker version 20
and docker compose version 2
Make sure you create a .env
file in the root directory with the keys API_USER=<your USAJobAPI username>
and API_KEY=<your api key>
- Run the application with
./run.sh
(you may need to update permissions withchmod +x run.sh
) - Open your browser on
localhost:8000/docs
to view the Swagger interface and trigger the ETL pipeline with the/etl
endpoint- Default parameters match project brief requirements
- View current stored jobs with the
/jobs
endpoint - Alternatively got to port
8081
to view the mongo UI (mongo express) usingmexpress
for username and password - Wipe the collection with
/wipe
- FastAPI used as the main interface to trigger the ETL pipeline and view current database contents
- Using the
requests
library to query the Jobs API provider - Passing in various query parameters e.g. location, job keyword, remuneration
- Parse response and use dataclass to enforce record schema for ingestion by MongoDB
- MongoDB is used here because multiple tables with relationships are not being created for simple ETL a NoSQL approach is appropriate
- User
upsert
operation with bulk write to mongo to update existing or insert new records - Return results of upsert operation to view records added or modified
- Pipeline using GitHub actions to test the application, build a docker image push it to AWS ECR and create an AWS Lambda from the image
- The docker container has access to a Mongo Atlas cluster
- Contact me to view the live Lambda
Include the following secrets if you want to deploy this yourself
API_KEY
: Jobs API keyAPI_USER
: Jobs API userAWS_ACCESS_KEY_ID
AWS_DEFAULT_REGION
AWS_ECR_IMAGE_URI
AWS_SECRET_ACCESS_KEY
MONGO_URI
: e.g. Mongo Alass connection URI
Since this application would be triggered on a regular schedule e.g. daily a serverless approach is more ideal than a continuous runtime, at least if the processing and ETL loads are relatively small
- Use AWS EventBridge to trigger the Lambda on a regular schedule
- Email user updated jobs report
- Using AWS DocumentDB as an analogue to MongoDb
- IaC approach could be cloud formation to provision Lambda and EventBridge rules during CI/CD
- For an application which requires higher throughput this approach may not be optimal. E.g. I had to increase the Lambda memory allocation
- In this case a more scalable solution could be to scale the AWS lambdas
- Or implement a continuous runtime approach such as Kubernetes for horizontal scaling with replica sets or AWS ECS architecture
- Eventually IO bottlenecks with MongoDB will need handling with sharding for example
- Analytics: Extend user API interface to display results on dashboards
- Further testing to include
db.py
unittests