webuniverseio / Devhub-Meetup-How-to-build-your-own-data-acquisition-framework-in-Typescript

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Devhub Meetup Notes: How to build your own data acquisition framework in Typescript

Data pipeline:

Events layer (event production, between user and system, click on a button, etc...) ->
Data streaming layer (kafka, pub-sub system, there is a producer, there are consumers, uses multiple topics) ->
Data extraction layer (consumer of kafka, which later puts data in specified places) ->
Storage layer (any storage system like hadoop, google cloud storage) ->
Data delivery (presto db for distributed query system) ->
UI layer (e.g. Mode, Jupyter, data scientist interactions start here)

Philosophy of data collection

collect everything

  • tons of data
  • never miss out on anything

vs

collect intentionally

  • only capture what you intend

Data acquisition layer

shopify built a Monorail which ensures a contract between producer of events and consumer of data
Monorail is doing that by using schemas
Schema might define click on a button event which has information about a click as well as user_id
Format might look as following (except in real life it is in yaml format, not json)

{
  name, doc: 'Click schema for shopify_admin', verson, format: 'json', 
  fields: [{
    name, type, optional: bool, doc, privacy_handler /*for user privacy policy, used to scrub personal data - for example for GDPR when user requests to delete all data*/
  }]
} 

Advantages

  • data s attaches contexts, which makes it easy to understand data
  • data consistency
  • registry of all metrics
  • out of the box privacy handling
  • versioning of events
  • bad data is caught early

Shopify built Monorail for multiple languages - ruby, php, js, python

Pipeline looks like this:
Client -> api call to Monorail -> kafka layer -> other layers mentioned in data pipeline above

In order to make sure that data being sent has correct types Monorail is using Typescript

Data scientist specifies schemas, commits them to a repo, where they are automatically transformed into typescript interfaces.

Function that uses interfaces is looking like following Monorail.produce('shopify_admin_click/1.0', user_id, ...)

Monorail takes advantage of discriminated unions (type Shape = Circle | Rectangle), so for Monorail.produce interface looks like this type MonorailType = AdminClickTypeGeneratedFromSchema | SomeOtherTypeGeneratedFromSchema | ... all interfaces from schemas

So interfaces are 'united' into MonorailType custom type which later is used to verify that event data is in correct format at the time when code is written.

About