yaetos_jobs

This repository consists of data pipelines using yaetos data framework (github.com/arthurprevot/yaetos). The code for these data pipelines is found in the "jobs" folder. Most pipelines are setup with small sample inputs so they should work out of the box.

Generative AI:

Data pipeline to pull information out of ChatGPT programmatically, to feed into datasets.
Data pipeline to fine-tune a "small" open source LLM called Albert, for classification, and to run inferences. The model is small enough to run from a laptop in minutes for the test case (no need for GPU).
Data pipeline to feed documents (pdf, text) to privateGPT vector database to add knowledge to local LLM.

Scientific (Climate data, image processing):

Data pipeline to process carbon emissions data from climate-trace (https://climatetrace.org/), with a sample dashboard available here
Data pipeline to process images (could be satellite, medical, etc) to find contours (@ scale, using Spark).

Sales/Marketing:

Data pipeline to pull employee contact information out of Apollo.io for a set of companies.
Data pipeline to pull information from Github contributors using Github API.

Other:

Data pipeline to showcase Yaetos core functionalities, using public wikipedia data.

Lots of room for improvements. Contributions welcome. Feel free to reach out at arthur@yaetos.com.

About

Examples of data pipelines using yeatos

Apache License 2.0

Languages

Language:Jupyter Notebook 97.3%Language:Python 2.4%Language:Scala 0.1%Language:Shell 0.1%Language:Dockerfile 0.0%