armlynobinguar / DataEngineeringPilipinas

Data Engineering Pilipinas is a community for data engineers, data analysts, data scientists, developers, AI / ML engineers, and users of closed and open source data tools and methods / techniques in the Philippines. Data Engineering Pilipinas is a PyData group.

Home Page:https://www.facebook.com/groups/dataengineeringpilipinas

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataEngineeringPilipinas - a PyData group

Data Engineering Pilipinas is a community for data engineers, data analysts, data scientists, developers, AI / ML engineers, and users of closed and open source data tools and methods / techniques in the Philippines. Data Engineering Pilipinas is a PyData group.

This page serves as a repository of notes, thoughts, ideas, plans, dreams, datasets, analyses, and whatever else we think of.

Data Engineering Domain As the name suggests, our community focuses on all data career paths with emphasis on data engineering.

Awesome Data Engineering Repository from the Philippines

Join our growing community!

Contents

Study Roadmap

Data Engineering - by Sandy

By Nicksy via Data Camp

Data Engineering

Data Engineering

Data Analyst

Data Analyst

Best Practices

Query Optimizations

Free Resources

  • FREE_RESOURCES.md - Compilation of Free Resources that you may find helpful in your journey.
  • Cloud Free Tier - Contains articles comparing the free tier offers of the major cloud providers like AWS, Azure, GCP, Oracle Cloud etc.

Data Storage & Databases

Relational Databases

  • PostgreSQL - is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
  • MySQL - the most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation.
  • Amazon Relational Database System (RDS) is a collection of managed services that makes it simple to set up, operate, and scale databases in the cloud. Choose from seven popular engines — Amazon Aurora with MySQL compatibility, Amazon Aurora with PostgreSQL compatibility, MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server

Columnar Databases

  • Amazon Redshift - Store, analyze, and process large amounts of data. Cloud Data Warehouse. PostgreSQL backend. MPP Engine and architecture. Available in Provisioned or Serverless.
  • Google BigQuery - is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.

Key-value

  • Redis - is an open source (BSD licensed), in-memory key-value cache, message broker, and streaming engine.
  • Amazon DynamoDB - is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale.

Object Storage

  • Amazon S3 - is an object storage service offering industry-leading scalability, data availability, security, and performance.
  • Azure Blob Storage - massively scalable and secure object storage for cloud-native workloads, archives, data lakes, high-performance computing, and machine learning.
  • Google Cloud Storage - is a managed service for storing unstructured data. Store any amount of data and retrieve it as often as you like.

Data Ingestion

  • Apache Kafka - a distributed event streaming platform.
    • Apache Kafka (open-source)
    • Apache Kafka (Confluent) - A Fully Managed Service of Apache Kafka that offers support from Kafka Committer-led experts, 99.99% uptime SLA, and etc. Apache Kafka in Confluent is Cloud-Native
    • Amazon Managed Streaming for Apache Kafka (Amazon MSK) - is a Fully Managed Kafka Service that operates, maintains, and scales Apache Kafka clusters, provides enterprise-grade security features out of the box, and has built-in AWS integrations that accelerate development of streaming data applications. Apache Kafka in AWS is Cloud-Hosted
  • AWS SDK for pandas (AWS Wrangler) - an open source python initiative that extends the power of the pandas library to AWS, connecting DataFrames and AWS data & analytics services. Open-source
  • AWS Kinesis - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
  • Airbyte - A data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes.
  • Pentaho Data Integration (Kettle) - consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations.

Data Formats

  • Apache Arvo - is the leading serialization format for record data, and first choice for streaming data pipelines.
  • Apache Parquet - is an open source, column-oriented data file format designed for efficient data storage and retrieval.
  • Apache ORC - the smallest, fastest columnar storage for Hadoop workloads.

Data Storage Framework

  • Delta Lake - is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. Led by Databricks.
  • Apache Iceberg - is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. Developed by Netflix
  • Apache Hudi - (pronounced Hoodie), stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Developed by Uber.

Batch Processing

Frameworks and Libraries

SQL

  • Presto - is a distributed SQL query engine for big data that allows you to run SQL queries against various data sources.
  • Apache Hive - is built on top of Apache Hadoop. A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.
  • Apache Drill- is an Apache open-source SQL query engine for Big Data exploration.
  • Trino - is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Managed Services (Cloud)

  • AWS Elastic MapReduce (EMR) - is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto.
  • AWS Glue - is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development.

Stream Processing

  • Spark Streaming (DStreams) - an extension of core Spark API for processing of live data streams. Deprecated as of Spark 2.0.
  • Apache Flink - is a framework and distributed processing engine for stateful computations over Data Streams
  • Apache Storm - is a free and open source distributed realtime computation system. Doing for realtime processing what Hadoop did for batch processing.

Data Stores

  • Apache Druid - is a high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.
  • Apache Pinot - realtime distributed OLAP datastore, designed to answer OLAP queries with low latency

Workflow Orchestration

Data Transformation

Frameworks

  • Data Build Tool (dbt) - is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.
  • SQLMesh - is an open source data transformation framework that brings the best practices of DevOps to data teams. It enables data scientists, analysts, and engineers to efficiently run and deploy data transformations written in SQL or Python.

Data Governance

Enterprise Data Catalog

  • DataHub Project - is an extensible metadata platform that enables data discovery, data observability and federated governance to help tame the complexity of your data ecosystem. Has open-source and Managed. Built by LinkedIn.
  • OpenMetadata - A Single Place to Discover, Collaborate and get your Data Right. Open-source. Inspired by Uber's metadata platform.
  • Apache Atlas - is an open-source metadata and big data governance framework which helps data users collaborate on their data assets. Open-source. Incubated by Hortonworks.
  • Amundsen - Open source data discovery and metadata engine. Created by Lyft.

Data Quality/Observability

  • Great Expectations - a platform for Data Quality.
    • Open-source - is a Python library that provides a framework for describing the acceptable state of data and then validating that the data meets those criteria.
    • Cloud (SaaS) -

Data Platforms

  • Databricks - Founders of Apache Spark. Combines Data Warehouse and Data Lake (Data Lake House) into a platform. Unified. Open. Scalable. Try it free for 14 days. Suggest that you use AWS as the choice of platform.
  • Snowflake - The Data Cloud. Cloud-native Data Warehouse Platform. Consists of Cloud Services Layer, Compute Layer, and Data Storage Layer. Try it free for 30 days.

Data dumps

Work in progress

Community Contents

Getting Started with Data Engineering

Description:

This video provides an introduction to Data Engineering. In partnership with StudevPH with guest speaker, Josh Dev

Link: https://www.facebook.com/studevph/videos/165090273259790


EP1 - Unlocking Your Future: Building a Career in PH Tech Startups

Description:

This video discusses building a career in PH Tech Startups.

  • In Partnership with Filipino Web Development Peers, Hosted by FWDP Founder, David Genesis Pedeglorio
  • Guest Speaker: Andoy Montiel, Chief Data Officer of Packworks

Link: https://youtu.be/pzxFTFB8f6s


Unlocking Career Opportunities: Philippine Skills Framework for AI and Analytics

Description:

This video discusses careers in Analytics in the Philippines.

  • Guest Sherwin Pelayo, and hosted by Doc Ligot

Link: https://www.youtube.com/watch?v=_CjsYi9ivlc


TechSync 2023: Synchronizing Filipino Tech Communities

Description:

A FREE online event featuring content creators and thought leaders in the tech field:

  • JP "Sir JP" Lazro & Rhea Alum, StudevPH
  • Seiji Villafranca, Angular PH
  • David Genesis Pedeglorio & Renzo Marl Peralta, Filipino Web Development Peers
  • Josh "Josh Dev" Valdeleon, Data Engineering Pilipinas
  • Hosted by Kuya Dev and Doc Ligot

Link: https://www.facebook.com/watch/live/?ref=watch_permalink&v=1043806640102969

filWebDev Talks Ep. 10 ft. Myk Ogbinar of AI Network PH

Description:

This video provides an introduction a career in Data. In partnership with Filipino Web Development Peers and AI Network PH with guest speaker, Myk Ogbinar

Link: https://www.facebook.com/fwdpeers/videos/220432324246191

Kamustahan: A Panel Discussion on Building a Career in Tech

Description:

A podcast style session with Myk, Allan Aquino, JP Acuna, JP Lazaro, & Rod Basa, (and a special post from sir Nino from Learn with Jon), about navigating a tech career.

Link: https://www.youtube.com/watch?v=jcFALIHBSuQ

Exclusive Interview: Sandy Lauguico's Data Engineering Transition

Description:

In this exclusive interview, join me (Doc Ligot) as we explore the inspiring journey of Sandy Lauguico, who made a remarkable shift into the world of Data Engineering. Gain valuable insights, tips, and first-hand experiences as Sandy shares her transformation story. Discover the exciting world of data and engineering with us!

Link: https://www.youtube.com/watch?v=8pJMFi3kIfQ

From Dropout to Tech Star: Aemy Obinguar's Inspirational Tale

Description:

In this exclusive interview, join me(Doc Ligot) as I delve into the remarkable journey of Aemy Obinguar, who defied the odds by dropping out of school and ultimately rising to the position of Chief Technology Officer (CTO). Discover her insights, challenges, and successes in the tech world. Don't miss this inspiring story of determination and achievement.

Link: https://www.youtube.com/watch?v=GZcYyILg3kc

Posts from the community

About

Data Engineering Pilipinas is a community for data engineers, data analysts, data scientists, developers, AI / ML engineers, and users of closed and open source data tools and methods / techniques in the Philippines. Data Engineering Pilipinas is a PyData group.

https://www.facebook.com/groups/dataengineeringpilipinas


Languages

Language:HTML 98.5%Language:R 1.4%Language:Python 0.1%Language:TSQL 0.0%Language:Makefile 0.0%