nchammas / flintrock

A command-line tool for launching Apache Spark clusters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Option for Minimum EBS Root Volume Size

PiercingDan opened this issue · comments

  • Flintrock version: 0.7.0

There should be an option to modify min_root_device_size_gb = 30 in line 626, ec2.py to any desired value in the flintrock configuration file. 30 GB may be excessive and costly in some cases, provided the AMI is smaller than 30 GB (10 GB in my case).

Edit: I address this also in my guide.

If I'm remembering my Flintrock history correctly, I believe I set the default size to 30 GB because 10 GB is not enough to build Spark from source, which is one of the features that Flintrock supports. The initial 10 GB default was also reported as too small by several early users of Flintrock. I set this new default in #50.

What's the additional cost when going from 10 GB to 30 GB for the root volume if, say, you have a 100-node cluster? I remember it being minuscule, but I don't have a hard calculation documenting it.

I'm inclined to leave this default as-is without an option to change it, since every new option complicates things a bit. But if the added cost is significant I would be open to reconsidering, since I know one of the reasons people use Flintrock over, say, EMR is to cut costs.

EDIT: Below has been modified

From my guide (based on https://aws.amazon.com/ebs/pricing/)

The price for Amazon EBS gp2 volumes is $0.10 per GB-month for US East and since Flintrock sets its default minimum EBS root volume to be 30 GB, the EBS volumes costs about $0.10/hour day per instance or $0.004/hour per instance regardless of the instance type or AMI, whereas spot-requested m3.medium instances cost about $0.01/hour per instance.

The price is comparable to the instance cost.

I find 30 GB EBS volume small for my hdfs cluster use. Is there any other way to increase hdfs cluster disk size ?

You could do one of the following:

  • Increase the size of your Snapshot/AMI you're launching from
  • Change min_root_device_size_gb = 30 to desired size in line 626, ec2.py

@PiercingDan EBS gp2 volume pricing is $0.10 per GB-month so it only cost $3 per month for 30 GB and hourly cost 3/(24*30)=0.004 is less then instance cost $0.01/hour

Good catch @pragnesh

@pragnesh:

I find 30 GB EBS volume small for my hdfs cluster use. Is there any other way to increase hdfs cluster disk size ?

Flintrock deploys HDFS to the EBS root volume only if the AMI has no ephemeral volumes attached. If you select a larger instance type that has ephemeral volumes (also called instance store volumes) Flintrock will use those instead for HDFS. That's because they are super fast (faster than EBS), and Flintrock users (from my understanding) typically use HDFS in conjunction with Spark to share things like shuffle files or temporarily stage data before starting their job. The permanent store for these users is typically something like S3. I strongly recommend against using Flintrock-managed HDFS as anything other than a temporary store for your data.

This should probably be documented explicitly somewhere. I don't believe it currently is.

@nchammas We use hdfs only for temporary store. I know we can use instance with ephemeral volume if we need more hdfs storage. But spot instance price for instance store volume usually high and also change frequently, so in order to avoid losing instance, we tend to use instance like m4.large. For us EBS performance for hdfs is not big issue compare to instance loss during running job. we can workaround this issue by having more instance. I just nice to have some setting during launch config.