This repository consists of data pipelines using yaetos data framework (github.com/arthurprevot/yaetos). The code for these data pipelines is found in the "jobs" folder. Most pipelines are setup with small sample inputs so they should work out of the box.
- Data pipeline to pull information out of ChatGPT programmatically, to feed into datasets.
- Data pipeline to fine-tune a "small" open source LLM called Albert, for classification, and to run inferences. The model is small enough to run from a laptop in minutes for the test case (no need for GPU).
- Data pipeline to feed documents (pdf, text) to privateGPT vector database to add knowledge to local LLM.
- Data pipeline to process carbon emissions data from climate-trace (https://climatetrace.org/), with a sample dashboard available here
- Data pipeline to process images (could be satellite, medical, etc) to find contours (@ scale, using Spark).
- Data pipeline to pull employee contact information out of Apollo.io for a set of companies.
- Data pipeline to pull information from Github contributors using Github API.
- Data pipeline to showcase Yaetos core functionalities, using public wikipedia data.
Lots of room for improvements. Contributions welcome. Feel free to reach out at arthur@yaetos.com.