Improve Punchcard user onboarding

Question

Improve Punchcard user onboarding

Birowsky opened this issue 4 years ago · comments

I'm starting to migrate a little Serverless app that I have to Punchcard, and report my experience along the way. It's a small app, and based on how it goes, I'll be doing the same for a different one that I have in production with a shitload of resources.

Some of the things I notice might just require better documentation, some might end up being feature requests. Whatever it ends up being, I'd just love to find a way to make this tool be my base infrastructure builder.

I'm starting here with a couple of things I noticed about dynamo with Punchcard's API.

How do I declare secondary global indexes?
How is the pattern of multiple record types per table handled?
If there is something missing in the Punchcard abstractions, can we fall down to the basic CDK constructs, like it's the case with the Serverless Framework?

I'm trying to migrate this table:

    RestaurantTable:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: restaurant
        AttributeDefinitions:
          - AttributeName: id
            AttributeType: S
          - AttributeName: byUserId
            AttributeType: S
          - AttributeName: status
            AttributeType: S
          - AttributeName: avgRating
            AttributeType: N
          - AttributeName: starCount
            AttributeType: S
        KeySchema:
          - AttributeName: id
            KeyType: HASH
        GlobalSecondaryIndexes:
          - IndexName: groupByStarCount
            KeySchema:
              - AttributeName: starCount
                KeyType: HASH
              - AttributeName: avgRating
                KeyType: RANGE
            Projection:
              ProjectionType: ALL
          - IndexName: groupByOwner
            KeySchema:
              - AttributeName: byUserId
                KeyType: HASH
              - AttributeName: avgRating
                KeyType: RANGE
            Projection:
              ProjectionType: ALL
          - IndexName: groupByStatus
            KeySchema:
              - AttributeName: status
                KeyType: HASH
              - AttributeName: avgRating
                KeyType: RANGE
            Projection:
              ProjectionType: ALL
        BillingMode: PAY_PER_REQUEST

sam · Answer 1 · Tue Feb 04 2020 07:19:28 GMT+0800 (China Standard Time)

Hey! Thanks so much for helping to improve Punchcard. You're totally right about the on-boarding documentation - it has not received the attention it requires. I appreciate having a use-case to fulfill as a way of ensuring a great developer experience. I'll do my best to answer your questions and provide features where things are missing - have already derived a punch of work items from this issue that I'll treat as a high priority.

How do I declare secondary global indexes?

You'd currently have to "drop down" to the CDK layer to define indexes. It was possible once upon a time to do this in Punchcard but the rapid change eroded that feature. Tracking a Punchcard feature for this here: #104

For now, you could map into the Build context and define it with the CDK:

const table = new DynamoDB.Table(..); // punchcard DDB table
table.resource.map(table => {
  // now in the Build context, table is an @aws-cdk/aws-dynamodb.Table resource
  table.addGlobalSecondaryIndex(..);
});

The worst part of this experience will be when you want to depend on it, as you'll have to manually assign environment variables and IAM permissions by implementing Dependency:

export interface Dependency<D> {
  install: Build<Install>;
  bootstrap: Run<Bootstrap<D>>;
}
const myGsiDependency: Dependency<any> = {
  install: table.resource.map(table => (namespace, grantable) => {
    // at build-time, set up the runtime environment with information needed to use the index
    namespace.set('tableName', table.tableName);
    namespace.set('indexName', ...);
    table.grantReadData(grantable); // does this grant access to the index?
  }),
  bootstrap: Run.of(async (namespace, cache) => {
    // at runtime, use the namespace and cache to initialize the client
    const indexName = namespace.get('indexName');
    return new MyIndexClient(indexName);
  })
}

I wonder if we could simplify this mapping process? Maybe some combinators or builders that construct the above implementation? Tracking: #106

How is the pattern of multiple record types per table handled?

Do you mean polymorphic tables where you store different record types in the same table that use the same key structure?

I think this would require a union type in Punchcard (tracking here: #105):

class A extends Record({
  key: string,
  type: string,
  // specific to A
  count: number
}) {}

class B extends Record({
  key: string,
  type: string,
  // specific to B
  label: string,
}) {}

const table = new DynamoDB.Table(stack, 'id', union(A, B), 'key')

A workaround would be to use any or unknown for a column and then store your variable data in there:

class A extends Record({
  key: string,
  variableData: any
}) {}

For expressions, you can cast the any field to a type and still gain the benefit of the type-safe DSLs:

await table.update('key', _ => [
  _.variableData.as(string).set('type-safe string value'),
]);

If there is something missing in the Punchcard abstractions, can we fall down to the basic CDK constructs, like it's the case with the Serverless Framework?

I briefly showed how to do this with the DynamoDB example above. Punchcard wraps all the CDK stuff in a lazily evaluated Build monad object. To "drop-down", you only need to "map into" that object. All "Punchcard resources" implement the Resource interface that exposes the underlying CDK layer:

export interface Resource<R> {
  resource: Build<R>;
}

I need to add substantial documentation to improve the understandability of this concept. Sorry about the on-boarding pain caused by this ...

Daniel Popeski · Answer 2 · Wed Feb 05 2020 10:25:22 GMT+0800 (China Standard Time)

First off, the delight is all mine! Especially knowing how famous and untouchable you are about to become : )

Aight, let’s get into it.

Thanx about the ‘dropping down to CDK’ guide. It deserves to be in the docs. But since I don’t know what exactly Build context means there, I have these extra questions:

Why do we (api consumers) need to provide a callback in order to mutate the resource instance? Isn’t the CDK build process completely synchronous? If this ends up being a dumb question, please just let me know what does ‘Build context’ mean here, and how it relates to the whole build mechanism that Punchcard does. This is something I’d love to see in the docs.
Some more specific build questions:
1. I saw a bunch of extra code inside the output app.js, what exactly is it?
2. Extra code directly affects lambda start-ups, are there steps we take to minimize it, or exclude whatever is not used from it? (ideally for each lambda separately)  Again, I know that the answers to these questions might be in the CDK docs. But I’d ideally like to have the same experience with Punchcard, as I had with the Serverless Framework: “use serverless-webpack to build your lambdas; enable ‘individually’ to produce the smallest possible assets.” I just loved how straightforward that was, not needing to understand anything extra about CloudFormation.   If, however, you prefer I better familiarize myself with the workings of CDK, so we may have more constructive conversation, do say so.

On to manually implementing dependencies.
True, the api here does seem a bit more intense. But since I still don’t understand the build/deploy process, I’m not really at liberty to comment on it just yet. Once I do acquire the core concepts, I’ll be happy to provide my input around it.

 Do you mean polymorphic tables where you store different record types in the same table that use the same key structure?

Yessir! Union types should work perfectly here.

sam · Answer 3 · Wed Feb 05 2020 19:42:05 GMT+0800 (China Standard Time)

Why do we (api consumers) need to provide a callback in order to mutate the resource instance? Isn’t the CDK build process completely synchronous? If this ends up being a dumb question, please just let me know what does ‘Build context’ mean here, and how it relates to the whole build mechanism that Punchcard does.

Yes, the CDK is synchronous, but Build has nothing to do with synchronous vs asynchronous. Build is one of two "contexts" in which a Punchcard application can be executed:

Build - executed during cdk synth. Includes expensive CDK code like zipping files and building docker images.
Run - executed at runtime, e.g. in a Lambda Function or Docker Container. Includes things like AWS SDKs and data serialization.

When you call context.map(table => ..) your callback will ONLY be evaluated in the respective context. The reason for this lazy callback is because running the CDK code at runtime would be hugely expensive and could potentially break things (e.g. creating things like Assets or Docker images would be impossible). For Punchcard to be useful it needs to support the entire CDK ecosystem, and the solution I came up with was lazily evaluated contexts (Monads?).

If you're interested, it was inspired by the IO Monad - see: A gentle introduction to Haskell: IO and Scala's Cats Effect IO

Previous discussions go into deeper detail: #54 and #53

I saw a bunch of extra code inside the output app.js, what exactly is it?

Can you provide an example? AFAIK, I'm not doing anything to influence that.

Extra code directly affects lambda start-ups, are there steps we take to minimize it, or exclude whatever is not used from it? (ideally for each lambda separately)  Again, I know that the answers to these questions might be in the CDK docs. But I’d ideally like to have the same experience with Punchcard, as I had with the Serverless Framework: “use serverless-webpack to build your lambdas; enable ‘individually’ to produce the smallest possible assets.” I just loved how straightforward that was, not needing to understand anything extra about CloudFormation.   If, however, you prefer I better familiarize myself with the workings of CDK, so we may have more constructive conversation, do say so.

This isn't specific to the CDK. Punchcard is currently tightly coupled to webpack and it might be a problem - when you synth your app, Punchcard runs webpack for you to create a small bundle and S3 asset which the CDK then deploys to AWS Lambda. #100 is tracking an idea to de-couple Punchcard from webpack and leave it up to developers. Developers would be free to use tools like serverless-webpack.

Seems like this is what you're advocating for?

Some questions:

Can the experience be as seamless as it is now - developers simply compile their code and run cdk deploy. Webpack is ran automatically.
How to support different bundle configurations for different runtime environments - e.g. you may not want to use webpack if you're deploying to Docker? Scaling this could be a problem since Punchcard's high-level abstraction can create a lot of resources quickly. Perhaps a bundling configuration per environment: Lambda, ECS and EC2?
Can we eliminate CDK code from the runtime bundle entirely? I've been dreaming of achieving this but I'm not sure how to yet. It's a problem of having a dependency relationship to the CDK instead of a devDependency. Build makes it so we at least don't run CDK code at runtime, but it's still imported ...

sam · Answer 4 · Fri Feb 07 2020 20:33:06 GMT+0800 (China Standard Time)

In the upcoming version (v0.13.0), you'll be able to do the following:

See (#108) for details.

class RestarauntData extends Record({
  id: string,
  byUserId: string,
  status: string,
  avgRating: number,
  starCount: string
}) {}

const RestaurantTable = new DynamoDB.Table(stack, 'RestaurantTable', {
  data: RestarauntData,
  key: {
    partition: 'id'
  }
}, Build.of({
  billingMode: dynamodb.BillingMode.PAY_PER_REQUST
}));

const groupByStarCount = RestaurantTable.globalIndex({
  indexName: 'groupByStarCount',
  key: {
    partition: 'byUserId',
    sort: 'avgRating'
  }
});

const groupByOwner = RestaurantTable.globalIndex({
  indexName: 'groupByOwner',
  key: {
    partition: 'id',
    sort: 'avgRating'
  }
});

const groupByStatus = RestaurantTable.globalIndex({
  indexName: 'groupByStatus',
  key: {
    partition: 'status',
    sort: 'avgRating'
  }
});

Daniel Popeski · Answer 5 · Sat Feb 08 2020 05:51:19 GMT+0800 (China Standard Time)

Doing some traveling these days. I'll get back to you asap.

sam · Answer 6 · Sat Feb 08 2020 18:26:25 GMT+0800 (China Standard Time)

Enjoy your travels! :)

Daniel Popeski · Answer 7 · Tue Feb 11 2020 22:24:02 GMT+0800 (China Standard Time)

Hello hello!
Finally settled in Lisbon. Seems like the food is gonna kill me here. At least I'll be happy.

Build vs run contexts: what I understood is that the whole code is bundled together and a part of it is run during build, and the other part is run as the lambdas are being called. But there's also a part running for both contexts. I guess I was a bit naive to think that the infrastructure code would be separated from the lambdas execution code itself 😊. So what's the penalty for this approach? How much execution overhead is there when running the lambda container for the first time? (I suppose there's only first time execution overhead?) Also, is there overhead in the lambda bundle size? I can see that the compiled app.js in the example repo is 1.6MB which is not too bad, but does it grow in any significant way?
Building mechanism: You suggest running tsc before running cdk. But I’m quite uncomfortable having all the build artifacts within my main codebase. I can mitigate this by introducing Webpack to my project, which would bundle everything inside a build output directory from which I would run cdk. Do you see any issue with this approach?  
Bundling lambdas independently: I tried building two lambdas with hopes of them being bundled and tree-shaken independently:

Lambda.schedule(stack, 'MyFunction1', {
  schedule: Schedule.rate(cdk.Duration.minutes(1)),
}, () => Promise.resolve('Hello world 1'));

Lambda.schedule(stack, 'MyFunction2', {
  schedule: Schedule.rate(cdk.Duration.minutes(1)),
}, () => Promise.resolve('Hello world 2'));

But the output looks the same as if there was just one lambda:

Which makes me think that the same bundle is pushed for every lambda, correct? I’m quite wary of how lambda size influences cold starts. Some lambdas might have big-ass dependencies like image-processing or browser rendering packages, which we expect to be running slowly. But then there are the lean API-layer lambdas, which do not depend on those big packages, and should run and respond as fast as possible.

Thanx for adding the Dynamo features!

As you might see, I'm trying to get comfortable with the build process before I focus on the Punchcard API.

Thanx!

sam · Answer 8 · Fri Feb 21 2020 17:46:57 GMT+0800 (China Standard Time)

Sorry, been really busy with a hard problem. I'm trying to build a DSL for Step Functions and API Gateway and it's been really challenging.

a part running for both contexts

Yes, this is the static part of the application. It is what is created in memory by requiring/importing the application's index file. It should instantiate the whole tree with Build and Run contexts "hanging" off it. It's a skeleton of the application. I refer to it as the Static scope.

Executing the application then becomes either:

a traversal of the tree and execution of all Build contexts within it - we do this when we want to instantiate the CDK construct tree and synthesize a Cloud Assembly.
lookup a runtime entrypoint by id, jump to it, and evaluate its Run context. Run contexts are different to Build. With Build, we want to execute all possible branches, but with Run, we only want to execute code paths required by an individual entrypoint.

Check out how Lambda.Function stores the entrypoint:
https://github.com/punchcard/punchcard/blob/c420ac50dc05946a9accc3077c7c985c958afc43/packages/punchcard/lib/lambda/function.ts#L77-L80

This is the root of a Run tree that will be evaluated by the Lambda Function and return a promise to a function handler. Basically, a Run context is an asynchronous bootstrap procedure that is run once per execution container. It gives dependencies the option to perform asynchronous operations on startup.

Do you see any issue with this approach?

Yeah I have a problem with it. Webpack is an Ok workaround but it's far from ideal. By merging infrastructure and runtime code, it forces your runtime archive to include some build-time archives. I want a better solution but I'm not sure what to do at this time. 1-2 MB is OK for an archive size, but the worse problem I've encountered is where the memory usage is double (unable to support 128MB functions) when deploying with webpack --development. Webpack destroyssss your stack traces without it, so you really want the map files in production. So we need to better!

Some ideas:

Can we use dev dependencies in a clever way to strip them out? Problem is if a module requires a cdk module, then app will crash.
Use webpack define? https://webpack.js.org/plugins/define-plugin/.
Is rollup useful? https://rollupjs.org/guide/en/
What about babel? https://babeljs.io/.
It's a long shot, but perhaps a TSC compiler plugin could be used to write a runtime and build-time version of the app?
Pulumi's Closure Serializer might be really useful (https://github.com/pulumi/pulumi/blob/master/sdk/nodejs/runtime/closure/createClosure.ts#L217). They came up with a nice way using the TSC compiler to walk the stack at runtime and serialize a tiny closure containing a function and its entire lexical scope. I've tried integrating it but ran into lots of limitations, especially around this scope.

I tried with Build to formally define the relationship between Punchcard and the CDK code with hopes that it will help us define some heuristics with one of the above methods to produce a much smaller archive for runtime. Build and Run are out greatest allies for performing tree-shaking as the unambiguously separate the two domains. I once tried a webpack plugin to remove import statements for packages that match a regex @aws-cdk/* but had problems when running it, but I think there's potential there - archive was down to 60-300KB.

I'd love it if someone who knew more about bundling could experiment and see how small the archive can get while maintaining good stack traces for logging.

Which makes me think that the same bundle is pushed for every lambda, correct? I’m quite wary of how lambda size influences cold starts. Some lambdas might have big-ass dependencies like image-processing or browser rendering packages, which we expect to be running slowly. But then there are the lean API-layer lambdas, which do not depend on those big packages, and should run and respond as fast as possible.

I totally agree :)

sam · Answer 9 · Fri Feb 21 2020 17:48:15 GMT+0800 (China Standard Time)

Oh, and congrats on the move to Lisbon! :) Thanks again for providing such useful feedback! Sorry that I sometimes take a while to respond, I need to do better at that.

Daniel Popeski · Answer 10 · Sun Feb 23 2020 22:13:00 GMT+0800 (China Standard Time)

“Sorry that I sometimes take a while to respond”

Absolutely nothing to worry about. I’m delighted to have my little part in your process here.

Thanx a ton for your explication! I really hope you grab the attention of somebody with the relevant expertise.

FWIW, I started getting familiar with CDK, so at least I’ll be able to follow the upcoming discussions :}

Thanx!