DataEngineeringPilipinas - a PyData group

Data Engineering Pilipinas is a community for data engineers, data analysts, data scientists, developers, AI / ML engineers, and users of closed and open source data tools and methods / techniques in the Philippines. Data Engineering Pilipinas is a PyData group.

This page serves as a repository of notes, thoughts, ideas, plans, dreams, datasets, analyses, and whatever else we think of.

As the name suggests, our community focuses on all data career paths with emphasis on data engineering.

Awesome Data Engineering Repository from the Philippines

Join our growing community!

Study Roadmap
Free Study Resources
Data Storage & Databases
Data Ingestion
Data Formats
Stream Procesisng
Batch Processing
Workflow Orchestration
Data Transformation
Data Governance
Data Platforms
Community Contents
PH-based Datasets (can be used for projects)

Study Roadmap

Data Engineering - by Sandy

DataEngineerRoadmap_Notion - Data Engineering roadmap with a variety of course options from free to paid.

By Nicksy via Data Camp

Data Engineering

Data Analyst

Best Practices

Query Optimizations

Table Indexes

Free Resources

FREE_RESOURCES.md - Compilation of Free Resources that you may find helpful in your journey.
Cloud Free Tier - Contains articles comparing the free tier offers of the major cloud providers like AWS, Azure, GCP, Oracle Cloud etc.

Data Storage & Databases

Relational Databases

PostgreSQL - is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
MySQL - the most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation.
Amazon Relational Database System (RDS) is a collection of managed services that makes it simple to set up, operate, and scale databases in the cloud. Choose from seven popular engines — Amazon Aurora with MySQL compatibility, Amazon Aurora with PostgreSQL compatibility, MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server

Columnar Databases

Amazon Redshift - Store, analyze, and process large amounts of data. Cloud Data Warehouse. PostgreSQL backend. MPP Engine and architecture. Available in Provisioned or Serverless.
Google BigQuery - is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.

Key-value

Redis - is an open source (BSD licensed), in-memory key-value cache, message broker, and streaming engine.
Amazon DynamoDB - is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale.

Object Storage

Amazon S3 - is an object storage service offering industry-leading scalability, data availability, security, and performance.
Azure Blob Storage - massively scalable and secure object storage for cloud-native workloads, archives, data lakes, high-performance computing, and machine learning.
Google Cloud Storage - is a managed service for storing unstructured data. Store any amount of data and retrieve it as often as you like.

Data Ingestion

Apache Kafka - a distributed event streaming platform.
- Apache Kafka (open-source)
- Apache Kafka (Confluent) - A Fully Managed Service of Apache Kafka that offers support from Kafka Committer-led experts, 99.99% uptime SLA, and etc. Apache Kafka in Confluent is Cloud-Native
- Amazon Managed Streaming for Apache Kafka (Amazon MSK) - is a Fully Managed Kafka Service that operates, maintains, and scales Apache Kafka clusters, provides enterprise-grade security features out of the box, and has built-in AWS integrations that accelerate development of streaming data applications. Apache Kafka in AWS is Cloud-Hosted
AWS SDK for pandas (AWS Wrangler) - an open source python initiative that extends the power of the pandas library to AWS, connecting DataFrames and AWS data & analytics services. Open-source
AWS Kinesis - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
Airbyte - A data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes.
- Open-source
- Airbyte Cloud
Pentaho Data Integration (Kettle) - consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations.
- Community Edition

Data Formats

Apache Arvo - is the leading serialization format for record data, and first choice for streaming data pipelines.
Apache Parquet - is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Apache ORC - the smallest, fastest columnar storage for Hadoop workloads.

Data Storage Framework

Delta Lake - is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. Led by Databricks.
Apache Iceberg - is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. Developed by Netflix
Apache Hudi - (pronounced Hoodie), stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Developed by Uber.

Batch Processing

Frameworks and Libraries

Apache Spark - is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- PySpark
- Scala
- Pandas API on Spark - Allows pandas function on top of Spark
- Spark SQL - List of SQL Functions
- SparkR
- Java
Polars (Python) - is a lightning fast DataFrame library/in-memory query engine.
Dask (Python) - is a flexible library for parallel computing in Python.

SQL

Presto - is a distributed SQL query engine for big data that allows you to run SQL queries against various data sources.
Apache Hive - is built on top of Apache Hadoop. A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.
Apache Drill- is an Apache open-source SQL query engine for Big Data exploration.
Trino - is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Managed Services (Cloud)

AWS Elastic MapReduce (EMR) - is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto.
AWS Glue - is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development.

Stream Processing

Spark Streaming (DStreams) - an extension of core Spark API for processing of live data streams. Deprecated as of Spark 2.0.
- Spark Structured Streaming (DataFrames) - is a stream processing engine built on the Spark SQL engine.
Apache Flink - is a framework and distributed processing engine for stateful computations over Data Streams
Apache Storm - is a free and open source distributed realtime computation system. Doing for realtime processing what Hadoop did for batch processing.

Data Stores

Apache Druid - is a high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.
Apache Pinot - realtime distributed OLAP datastore, designed to answer OLAP queries with low latency

Workflow Orchestration

Apache Airflow - is an open-source workflow management platform for data engineering pipelines. Built by Airbnb.
- Astronomer
- Amazon Managed Workflows for Apache Airflow (MWAA)
Mage - Open-source data pipeline tool for transforming and integrating data. The modern replacement for Airflow.
Dagster - An orchestration platform for the development, production, and observation of data assets.
- Open-source
- Dagster Cloud
Prefect - is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines.
- Open-source
- Prefect Cloud
Kestra - is a universal open-source orchestrator that makes both scheduled and event-driven workflows easy
- Open-source
- Kestra Cloud
AWS Step Functions - is a fully managed service that makes it easier to coordinate the components of distributed applications and microservices using visual workflows.

Data Transformation

Frameworks

Data Build Tool (dbt) - is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.
- dbt-core (open-source)
- dbt Cloud
SQLMesh - is an open source data transformation framework that brings the best practices of DevOps to data teams. It enables data scientists, analysts, and engineers to efficiently run and deploy data transformations written in SQL or Python.

Data Governance

Enterprise Data Catalog

DataHub Project - is an extensible metadata platform that enables data discovery, data observability and federated governance to help tame the complexity of your data ecosystem. Has open-source and Managed. Built by LinkedIn.
OpenMetadata - A Single Place to Discover, Collaborate and get your Data Right. Open-source. Inspired by Uber's metadata platform.
Apache Atlas - is an open-source metadata and big data governance framework which helps data users collaborate on their data assets. Open-source. Incubated by Hortonworks.
Amundsen - Open source data discovery and metadata engine. Created by Lyft.

Data Quality/Observability

Great Expectations - a platform for Data Quality.
- Open-source - is a Python library that provides a framework for describing the acceptable state of data and then validating that the data meets those criteria.
- Cloud (SaaS) -

Data Platforms

Databricks - Founders of Apache Spark. Combines Data Warehouse and Data Lake (Data Lake House) into a platform. Unified. Open. Scalable. Try it free for 14 days. Suggest that you use AWS as the choice of platform.
Snowflake - The Data Cloud. Cloud-native Data Warehouse Platform. Consists of Cloud Services Layer, Compute Layer, and Data Storage Layer. Try it free for 30 days.

Data dumps

Work in progress

Community Contents

Getting Started with Data Engineering

Description:

This video provides an introduction to Data Engineering. In partnership with StudevPH with guest speaker, Josh Dev

Link: https://www.facebook.com/studevph/videos/165090273259790

EP1 - Unlocking Your Future: Building a Career in PH Tech Startups

Description:

This video discusses building a career in PH Tech Startups.

In Partnership with Filipino Web Development Peers, Hosted by FWDP Founder, David Genesis Pedeglorio
Guest Speaker: Andoy Montiel, Chief Data Officer of Packworks

Link: https://youtu.be/pzxFTFB8f6s

Unlocking Career Opportunities: Philippine Skills Framework for AI and Analytics

Description:

This video discusses careers in Analytics in the Philippines.

Guest Sherwin Pelayo, and hosted by Doc Ligot

Link: https://www.youtube.com/watch?v=_CjsYi9ivlc

TechSync 2023: Synchronizing Filipino Tech Communities

Description:

A FREE online event featuring content creators and thought leaders in the tech field:

JP "Sir JP" Lazro & Rhea Alum, StudevPH
Seiji Villafranca, Angular PH
David Genesis Pedeglorio & Renzo Marl Peralta, Filipino Web Development Peers
Josh "Josh Dev" Valdeleon, Data Engineering Pilipinas
Hosted by Kuya Dev and Doc Ligot

Link: https://www.facebook.com/watch/live/?ref=watch_permalink&v=1043806640102969

filWebDev Talks Ep. 10 ft. Myk Ogbinar of AI Network PH

Description:

This video provides an introduction a career in Data. In partnership with Filipino Web Development Peers and AI Network PH with guest speaker, Myk Ogbinar

Link: https://www.facebook.com/fwdpeers/videos/220432324246191

Kamustahan: A Panel Discussion on Building a Career in Tech

Description:

A podcast style session with Myk, Allan Aquino, JP Acuna, JP Lazaro, & Rod Basa, (and a special post from sir Nino from Learn with Jon), about navigating a tech career.

Link: https://www.youtube.com/watch?v=jcFALIHBSuQ

Exclusive Interview: Sandy Lauguico's Data Engineering Transition

Description:

In this exclusive interview, join me (Doc Ligot) as we explore the inspiring journey of Sandy Lauguico, who made a remarkable shift into the world of Data Engineering. Gain valuable insights, tips, and first-hand experiences as Sandy shares her transformation story. Discover the exciting world of data and engineering with us!

Link: https://www.youtube.com/watch?v=8pJMFi3kIfQ

From Dropout to Tech Star: Aemy Obinguar's Inspirational Tale

Description:

In this exclusive interview, join me(Doc Ligot) as I delve into the remarkable journey of Aemy Obinguar, who defied the odds by dropping out of school and ultimately rising to the position of Chief Technology Officer (CTO). Discover her insights, challenges, and successes in the tech world. Don't miss this inspiring story of determination and achievement.

Link: https://www.youtube.com/watch?v=GZcYyILg3kc

Posts from the community

Kyle Escosia - A Data Engineer who is passionately curious in anything about data.
- Featured Contents

DataEngineeringPilipinas - a PyData group

Awesome Data Engineering Repository from the Philippines

Contents

Study Roadmap

Data Engineering - by Sandy

By Nicksy via Data Camp

Data Engineering

Data Analyst

Best Practices

Query Optimizations

Free Resources

Data Storage & Databases

Relational Databases

Columnar Databases

Key-value

Object Storage

Data Ingestion

Data Formats

Data Storage Framework

Batch Processing

Frameworks and Libraries

SQL

Managed Services (Cloud)

Stream Processing

Data Stores

Workflow Orchestration

Data Transformation

Frameworks

Data Governance

Enterprise Data Catalog

Data Quality/Observability

Data Platforms

Data dumps

Community Contents

Getting Started with Data Engineering

EP1 - Unlocking Your Future: Building a Career in PH Tech Startups

Unlocking Career Opportunities: Philippine Skills Framework for AI and Analytics

TechSync 2023: Synchronizing Filipino Tech Communities

filWebDev Talks Ep. 10 ft. Myk Ogbinar of AI Network PH

Kamustahan: A Panel Discussion on Building a Career in Tech

Exclusive Interview: Sandy Lauguico's Data Engineering Transition

From Dropout to Tech Star: Aemy Obinguar's Inspirational Tale

Posts from the community

About

Languages