nelhage / llama

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EC2 + smurfs for cost optimization within a runtime budget?

chadbrewbaker opened this issue · comments

From a cost sweet spot, it might make sense to launch an EC2 Spot/Batch/Fargate/Codebuild then smurf more costly lambda workers on the highly parallel sections. For this to work you need artifact execution times for an architecture, localhost/network latency, and localhost/network bandwidth.

Mind dump of questions:

  • How do you get network/localhost profiles for latency/bandwidth?
sudo ping -U -q -i 0 -s 18 -w 10 IP_ADDRESS
  • Would monte-carlo help to know where you get value by speculatively executing more than one worker at once on the same task?

  • Is there a good way to take the ninja/makefile and emit the task graph for processing?

  • How much work to get task graph dumps of Rust Cargo builds?

  • How much can you get out of kernel tuning EC2? Again, see . He left of the table a PGO/BOLT build of the kernel itself.

  • Can the EC2 use sqlite, /dev/zram, or /dev/shm to get faster than localhost SSD IO? Does it make sense blasting all the code into a RAM filesystem before compiling on EC2 instances? Does the EC2 boxen even need attached storage?

  • Can you hack Clang/LLVM to memdump files for later use so they don't have to be serialized/deserialized? Also hack clang/llvm so it can work in batch mode like SMT solvers where it does pushes and pops to avoid startup overhead of the process with every invocation?

  • What is the boot cost of EC2 Spot/Batch/Fargate/Codebuild ? Seconds you waste + IO overhead cost on transferring the AMI/container.

  • When is there a S3/latency cost win by using zstd to dictionary compress a file at rest on S3? Could it help by training several dictionaries, is is one dictionary for all of the codebase best?

  • For container based EC2 runs, how much do you save by stripping the AMI/container image down to a minimal size? Are AMIs or containers more cost effective?

  • Can writing to larger files then reading them sparsely using HTTP range queries help?

  • How much do you save doing PGO/BOLT/LTO on llamacc binaries? Can you HTTP range query lazy load from S3 so the instance only needs the clang/llvm/linker portions it will need for a task?

  • Does it pay to be evil by using the AMI/containter/lambda_layer to store code being complied so it no-ops the S3 read? Even as evil as storing binary artifacts from the previous run and no-op parts of the task graph that are un-tainted from change?

  • For spot pricing is there a good tool to get costs across AZs/Regions?

  • How much of a win is there on using single AZ S3 instances?

  • When does EFS beat S3 on cost/performance?