LeanerCloud / AutoSpotting

Saves up to 90% of AWS EC2 costs by automating the use of spot instances on existing AutoScaling groups. Installs in minutes using CloudFormation or Terraform. Convenient to deploy at scale using StackSets. Uses tagging to avoid launch configuration changes. Automated spot termination handling. Reliable fallback to on-demand instances.

Home Page:https://autospotting.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Instance types "Unavailable in this Availability Zone" with erroneous price of 0

gabegorelick opened this issue · comments

Issue type

Bug Report

Build number

9b438dc

Configuration

allowed_instance_types: current, but not sure that matters.

Environment

  • AWS region: us-east-1
  • Type of environment: VPC

Summary

AutoSpotting is reporting that certain spot instance types have a price of 0.

Here's what I see in the logs:

instance.go:376: Comparing current type t2.small with price 0.023 with candidate t2.small with price 0

Since AutoSpotting interprets a spot price of $0 as meaning the instance type is unavailable, you get a message saying "Unavailable in this Availability Zone" and AutoSpotting refuses to use that instance type.

The weird thing is that calling describe-spot-price-history indicates that the instance type does seem to exist.

aws ec2 describe-spot-price-history --instance-types t2.small --availability-zone us-east-1b --product-description Linux/UNIX
{
    "SpotPriceHistory": [
        {
            "AvailabilityZone": "us-east-1b",
            "InstanceType": "t2.small",
            "ProductDescription": "Linux/UNIX",
            "SpotPrice": "0.006900",
            "Timestamp": "2020-01-24T09:30:58.000Z"
        },

This is happening intermittently for a number of instances types. I have no issues manually creating spot requests with the same instance type and AZ (they are fulfilled without issue).

Steps to reproduce

Setting allowed_instance_types: current makes the issue more visible, since if you have an "Unavailable instance" AutoSpotting won't substitute a replacement instance type.

Thanks for reporting this.

This price data is coming from the spot instance price history, and intermittent issues might indicate API throttling, especially considering that you seem to be trying to run multiple instances of AutoSpotting in parallel from the same account.

This price data is coming from the spot instance price history, and intermittent issues might indicate API throttling

It seems to be happening more consistently for some instance types than others. E.g. I saw it for a few hours straight with t2.medium while other instances returned data fine. So I'm not sure API throttling is the issue (or if it is, AWS is doing very strange throttling).

you seem to be trying to run multiple instances of AutoSpotting in parallel from the same account.

I actually still only have one instance of AutoSpotting. Gotta figure out these issues before rolling it out more 😄 But that lends more evidence for something beyond throttling going on.

I might be wrong but I think there is a possibility that we cache the pricing information within the Lambda function so you might have the same corrupt data for subsequent executions until the Lambda function is scheduled to another host. Can you see any throttling errors in the Cloudtrail logs?

@cristim I see no evidence of throttling in CloudTrail.

image

Could this be because DescribeSpotPriceHistory is a paginated endpoint, but AutoSpotting is not using the paginated version?

As far as I remember without looking at the code we just query the current price one instance type at a time so we shouldn't need to get more pages. I guess this can cause problems in case we query the price for multiple instance types that would not fit in a page.

@cristim Pretty sure it queries for all instance types in the region at once.

Regardless, I'll have a fix up shortly. Just adding some tests now.

Awesome, huge thanks!

See #408. Not sure if it will fix it, but worth a shot.

Haven't had any issues since upgrading to #408. I think we can close for now and reopen if it pops up again.

Great to hear that, thanks!