Amazon Managed Service for Apache Flink (formerly Amazon Kinesis Data Analytics) – Benchmarking Utility
🚨 August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink.
Amazon Kinesis Data Analytics Flink Benchmarking Utility helps with capacity planning, integration testing, and benchmarking of Kinesis Data Analytics for Apache Flink applications. Using this utility, you can generate sample data and write it to one or more Kinesis Data Streams based on the requirements of your Flink applications. This utility is used in conjunction with a Flink Application with Kinesis Data Stream as a source and one of the supported Sinks for e.g. Amazon S3.
Capacity planning, integration testing, and benchmarking of Flink applications generally involves a lot of work. This utility will provide you a solution where you can define data format, generate and write sample data to a Kinesis Data Stream. Using this Kinesis Data Stream as a source, you will create a Kinesis Data Analytics Flink application and perform necessary testing. The format used and the data generated are compatible with Flink application's business logic. You will define benchmarking specifications based on your capacity or load testing requirements.
This utility along with Amazon Kinesis Data Analytics Flink Starter Kit will provide you a complete example.
Contents:
- Architecture
- Detailed Architecture
- Application Overview
- Build Instructions
- Deployment Instructions
- Appendix
The below diagram represents the architecture of this utility.
The diagram below represents the detailed architecture.
In the diagram, each time the Kinesis data producer job is invoked, it runs one or more child child jobs based on the specifications provided in a JSON file. Each child job has the below characteristics:
- Generate sample records and write them to a Kinesis Data Streams. These records will be consumed by a Flink application
- Records are generated based on a pre-defined record format
- Records are randomized based on a number of unique identifiers
- Records generated as batches which has a configurable size
- Batches are written to Kinesis Stream with a configurable cadence in seconds
- Each child job terminates gracefully once it complete writing data to Kinesis Stream
Benchmarking Specifications section explains this process in detail.
- JDK 8
- IDE for e.g. Eclipse or Spring Tools or Intellij IDEA
- Apache Maven
- AWS CLI
The following AWS services are required for this utility:
- 1 Amazon EC2 Instance
- DynamoDB Local
- 1 Amazon Kinesis Data Stream
- 1 IAM role for EC2 instance
- 1 EC2 key pair to log onto EC2
This utility requires you to pass benchmarking specifications in other words load testing requirements in a JSON file based on a format defined in the sample benchmarking_specs.json.
The below Table will help you define the specifications:
Property | Type | Purpose |
---|---|---|
jobName |
String | The name of the Benchmarking Job |
jobDurationInMinutes |
String | The duration of the job e.g. 65 minutes |
region |
String | The AWS region where the target Kinesis Stream(s) exist |
targetKinesisStreams |
Array | Names of target Kinesis Streams. This utility writes sample data to one or more configured streams. |
isUsingDynamoDBLocal |
boolean | When DynamoDB Local is used for status tracking this attribute is set to true . When it is set to false , it will use Amazon DynamoDB web service. |
dynamoDBLocalURI |
String | The URI for DynamoDB Local |
parentJobSummaryDDBTableName |
String | The name of the DynamoDB Table for Parent Job Summary |
childJobSummaryDDBTableName |
String | The name of the DynamoDB Table for Child Job Summary |
childJobs |
Array | The list of Child Jobs to run part of the utility |
Property | Type | Purpose |
---|---|---|
jobName |
String | The name of child job |
numberofInteractions |
Integer | Number of unique session ids |
batchSize |
Integer | The size of the batch |
batchCadence |
Integer | The batch frequency in seconds |
numberofBatches |
Integer | Number of batches |
Class | Purpose |
---|---|
BenchmarkScheduler | Entry point and is the heart of the benchmarking utility. Its main main algorithm was developed based on open-source Quartz Job Scheduling Library. It schedules one more more Kinesis Producer jobs based on the benchmarking specifications. |
KinesisProducerForFlinkSessionWindow | The class has the business logic to write sample records to Kinesis Stream. The sample records are compatible with a Flink Application implements Session Window. This class also has the logic to track its own progress in DynamoDB tables. This class implements Job Interface from Quartz Scheduler. |
KDSProducerUtil | Utility class with methods used by KinesisProducerForFlinkSessionWindow class. |
KinesisStreamUtil | Utility class with business logic to work with Kinesis Data Stream. |
DDBUtil | Utility class with business logic to write / update items (records) to DynamoDB tables. |
- Clone this starter kit to your Laptop / MacBook
- It has Maven nature, so you can import it to your IDE.
- Build the Jar file using one of the steps below:
- Using standalone Maven, go to project home directory and run command
mvn -X clean install
- From Eclipse or STS, run command
-X clean install
. Navigation: Project right click --> Run As --> Maven Build (Option 4)
- Using standalone Maven, go to project home directory and run command
- Build process will generate a jar file
amazon-kinesis-data-analytics-flink-benchmarking-utility-0.1.jar
. Note: The size of the jar file is around 20 MB
-
Create an IAM role for EC2 instance. It needs to have two policies as below
- Policy with write permissions for one or more Kinesis Stream configured as targets for this utility
- Policy with write permissions for DynamoDB tables used by this utility
- Note: For more details, on this topic, refer Amazon EC2 documentation here.
-
Launch an EC2 instance with the IAM role
-
Take the Private IP address of your EC2
-
Log on to EC2 instance using command
ssh -i my_ec2_keypair.pem ec2-user@IP_Address
-
Run
sudo yum update -y
-
Install OpenJDK 8 using command
sudo yum -y install java-1.8.0-openjdk.x86_64
-
Check the Java version using the command
java -version
. Sample output:openjdk version "1.8.0_252" OpenJDK Runtime Environment (build 1.8.0_252-b09) OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
-
Create a folder for Data Generator Application Binary. Use the command
mkdir kda-flink-benchmarking-utility
-
Create a folder for DynamoDB Locals. Use the command
mkdir dynamodb_local
-
Go to folder kda-flink-benchmarking-utility and create a folder for Logging. Use the command
mkdir logs
-
Go to DynamoDB Local folder
cd dynamodb_local/
-
Download DynamoDB Local Binary
curl https://s3.us-west-2.amazonaws.com/dynamodb-local/dynamodb_local_latest.zip --output dynamodb_local_latest.zip
-
Unzip the file
unzip dynamodb_local_latest.zip
-
Start the DynamoDB Local
nohup java -jar DynamoDBLocal.jar -sharedDb &
-
Check the status of DynamoDB Local in the nohup.out as follows:
[ec2-user@ip-X-X-X-X ~]$ cat nohup.out Initializing DynamoDB Local with the following configuration: Port: 8000 InMemory: false DbPath: null SharedDb: true shouldDelayTransientStatuses: false CorsParams: *
-
At any time, check the status of DynamoDB Local using command
ps -ef
ec2-user 13995 1 0 Sep09 ? 00:12:54 java -jar DynamoDBLocal.jar -sharedDb
- Come back to your MacBook or Laptop
- In
src/main/resources/benchmarking_specs.json
update thetargetKinesisStreams
array to the kinesis stream(s) that you want data written to andregion
to the region the stream(s) exist in - In
src/main/resources/create_table_kinesis_stream.json
updateTableName
to match the kinesis stream you want data written to - If you have more than 1 stream to write to then duplicate
src/main/resources/create_table_kinesis_stream.json
for each stream changingTableName
in each file accordingly.
-
Copy Kinesis Data Generator binaries to the EC2 instance. Note: Steps below are relevant for SCP (secure copy) tool on MacBook.
-
Copy the jar file to EC2 instance
scp -i my_ec2_keypair.pem <path_to_your_ide_workspace>/Amazon-kda-flink-benchmarking-utility/target/amazon-kinesis-data-analytics-flink-benchmarking-utility-0.1.jar ec2-user@IP_Address:/home/ec2-user/kda-flink-benchmarking-utility/
-
Copy benchmarking specifications JSON to EC2 instance
scp -i my_ec2_keypair.pem <path_to_your_ide_workspace>/Amazon-kda-flink-benchmarking-utility/src/main/resources/benchmarking_specs.json ec2-user@IP_Address:/home/ec2-user/kda-flink-benchmarking-utility/
-
Copy DynamoDB Table JSON files to EC2 instance
scp -i my_ec2_keypair.pem <path_to_your_ide_workspace>/Amazon-kda-flink-benchmarking-utility/src/main/resources/create_table_child_job_summary.json ec2-user@IP_Address:/home/ec2-user/kda-flink-benchmarking-utility/
scp -i my_ec2_keypair.pem <path_to_your_ide_workspace>/Amazon-kda-flink-benchmarking-utility/src/main/resources/create_table_parent_job_summary.json ec2-user@IP_Address:/home/ec2-user/kda-flink-benchmarking-utility/
-
Copy DynamoDB Table JSON files for Kinesis Streams to EC2 instance
scp -i my_ec2_keypair.pem <path_to_your_ide_workspace>/Amazon-kda-flink-benchmarking-utility/src/main/resources/create_table_kinesis_stream.json ec2-user@IP_Address:/home/ec2-user/kda-flink-benchmarking-utility/
(Repeat for all duplicates created if you're writing to multiple streams)
-
Copy the Bash script to EC2 instance
scp -i my_ec2_keypair.pem <path_to_your_ide_workspace>/Amazon-kda-flink-benchmarking-utility/src/main/resources/amazon-kda-flink-benchmarking-utility.sh ec2-user@IP_Address:/home/ec2-user/kda-flink-benchmarking-utility/
-
Note: To use PuTTY, refer Connecting to Your Linux Instance from Windows Using PuTTY
-
-
From the EC2 instance and while in folder kda-flink-benchmarking-utility, run the below command to create DynamoDB tables
aws dynamodb create-table \ --cli-input-json file://create_table_parent_job_summary.json \ --region us-east-1 \ --endpoint-url http://localhost:8000
aws dynamodb create-table \ --cli-input-json file://create_table_child_job_summary.json \ --region us-east-1 \ --endpoint-url http://localhost:8000
aws dynamodb create-table \ --cli-input-json file://create_table_kinesis_stream.json \ --region us-east-1 \ --endpoint-url http://localhost:8000
(Repeat for all files if you're writing to multiple streams)
-
Check the tables by running the below command
aws dynamodb list-tables --region us-east-1 --endpoint-url http://localhost:8000
-
Expected output
{ "TableNames": [ "kda_flink_benchmarking_child_job_summary", "kda_flink_benchmarking_parent_job_summary", "<kinesis-stream(s)>" ] }
-
Check the status of Cron on EC2 instance using command
service crond status
. You will get an output something like below.Redirecting to /bin/systemctl status crond.service ● crond.service - Command Scheduler Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-09-07 09:50:49 UTC; 3 days ago
-
Open the crontab using the command
crontab -e
-
Enter the following line
30 * * * * /bin/bash /home/ec2-user/kda-flink-benchmarking-utility/amazon-kda-flink-benchmarking-utility.sh
This will run the data-generator as a cron-job every hour at 30 minutes past the hour. Once the job starts for the first time you should see incoming data on your kinesis stream(s)
Using DynamoDB web service instead of DynamoDB Local is easy. Follow the below instructions:
-
In benchmarking_specs.json, set
"isUsingDynamoDBLocal":
tofalse
and you can leave the default value for attribute"dynamoDBLocalURI:"
as is or set it to"None"
. -
Run the below command to create tables in Amazon DynamoDB web service:
aws dynamodb create-table \ --cli-input-json file://create_table_kinesis_stream.json \ --region us-east-1
aws dynamodb create-table \ --cli-input-json file://create_table_parent_job_summary.json \ --region us-east-1
aws dynamodb create-table \ --cli-input-json file://create_table_child_job_summary.json \ --region us-east-1
- Support Amazon ECS to host the solution
- Support additional scheduling methods
- Generate test data suitable for Tumbling Windows and Sliding Windows
- Ability to generate test data with timestamp information. This will be useful for Flink Applications configured to use Event Time or Processing Time. For more details, refer Apache Flink documentation.
This sample code is made available under the MIT-0 license. See the LICENSE file.